library(e1071) # to understand skewness
library(dplyr)
library(stringr) # Used to rename the columns by removing the word team from the column header
library(VIM) # To understand NAs
library(caret)
library(mice) # Imputation 1
library(missForest) # Imputation 2
library(MASS) # to use for robust Linear Regression.
# browse to the data
moneyball = read.csv('/Users/legs_jorge/Documents/Data Science Projects/MSDS_Northwestern/MSDS 411/Unit 01 Moneyball Baseball Problem/Data/moneyball.csv', header = T)
colnames(moneyball) <- str_replace_all(colnames(moneyball),"TEAM_","") %>%
tolower() # Fixing column names
The moneyball dataset has sparked many companies, teams, and organizations to understand and utilize the data they generate/gather. This project highlights many pitfalls that those same individuals fall into simply because they forgot to do the due diligence and prepare the data before modeling.
This paper will focus on;
1. Data Exploration
2. Data Transformation
3. Model Building
4. How to select the best model
R gives us a lot of ways to understand the distribution of Nulls within the data. Let’s first try to calculate the percentage of Null values to the total number of observation.
NAPerc <-
sapply(moneyball, function(x)
(sum(is.na(x)) / length(x)) * 100) %>%
data.frame()
NAPerc$Column <- rownames(NAPerc)
colnames(NAPerc) <- c("NA_Perc", "Col_Name")
# Trying to understand the percentage of NAs per Column
NA_col <- subset(NAPerc, NA_Perc > 0) %>% arrange(desc(NA_Perc))
NA_col
Let’s look at the pattern of missing data to try to get more insights. It’s clear that batting_hbp is going to be a problematic column with 92% of the data missing. Before we start the imputation or deleting variables, let’s try to understand why we have missing data.
Let’s use the mice package to help us understant how all the NAs behave in the data. mice provides a handy function called md.pattern that allows one to understand the pattern of missing data. Hopefully by looking at the pattern, we can have an idea on why the data could be missing.
md.pattern(moneyball) %>% data.frame()
index target_wins batting_h batting_2b batting_3b batting_hr batting_bb
191 1 1 1 1 1 1 1
1295 1 1 1 1 1 1 1
349 1 1 1 1 1 1 1
18 1 1 1 1 1 1 1
53 1 1 1 1 1 1 1
190 1 1 1 1 1 1 1
102 1 1 1 1 1 1 1
78 1 1 1 1 1 1 1
0 0 0 0 0 0 0
pitching_h pitching_hr pitching_bb fielding_e batting_so pitching_so
191 1 1 1 1 1 1
1295 1 1 1 1 1 1
349 1 1 1 1 1 1
18 1 1 1 1 1 1
53 1 1 1 1 1 1
190 1 1 1 1 1 1
102 1 1 1 1 0 0
78 1 1 1 1 1 1
0 0 0 0 102 102
baserun_sb fielding_dp baserun_cs batting_hbp V18
191 1 1 1 1 0
1295 1 1 1 0 1
349 1 1 0 0 2
18 1 0 1 0 2
53 0 1 0 0 3
190 1 0 0 0 3
102 1 1 0 0 4
78 0 0 0 0 4
131 286 772 2085 3478
The first column of the output shows the number of unique missing data patterns. There are 191 observations with nonmissing values, and there are 1295 observations with nonmissing values except for the variable batting_hbp. The rightmost column shows the number of missing variables in a particular missing pattern. For example, the first row has no missing value and it is “0” in the row. The last row counts the number of missing values for each variable. For example, the variable pitching_bb contains no missing values and the variable batting_so contains 102 missing values. This table can be helpful when you decide to drop some observations with missing variables exceeding a preset threshold.
After careful analysis, the decision is to keep batting_hbp. Because I want to transform it into a binary variable, and will keep it out until all the imputation is done.
batting_hbp_bi <- if_else(is.na(moneyball$batting_hbp),0,1)
batting_hbp <- moneyball$batting_hbp
moneyball_trans <- subset(moneyball, select = -c(batting_hbp))
Let’s impute and treat the data for missing values before testing it for multicollinearity.
The missForest package will be the package used to help us with this task. missForest is an implementation of random forest algorithm. It’s a non parametric imputation method applicable to various variable types. A great resource to understand this techinique is found here.
Let’s add batting_hbp back into the data.
moneyball_MF$batting_hbp <- if_else(is.na(batting_hbp),0,as.numeric(batting_hbp))
moneyball_MF$batting_hbp_bi <- batting_hbp_bi
Outliers can cause our model to produce the wrong output by influencing its fit. Creating boxplots will aid in identifying those outliers. We can also use the cleveland dotplot to understand the outliers better. This technique uses the row number against actual value to quickly point out any patterns of outliers. This plot will easilly allow us to check the raw data for errors such as typos during the data collection phase. Points on the far right side, or on the far left side, are observed values that are considerably larger, or smaller, than the majority of the observations, and require further investigation. When we use this chart, together with the box plot and histogram, we can easily identify patterns at to where in the data we’re seeing outliers.
par(mfrow = c(1, 3))
i = 2
while (i %in% c(2:17)) {
out.lier <- boxplot.stats(moneyball_MF[,i])$out
plot(moneyball_MF$target_wins, moneyball_MF[,i],col=ifelse(moneyball_MF[,i] %in% out.lier, "red", "blue"), xlab = colnames(moneyball_MF)[i] , ylab = "Target Wins", main = paste("Scatter Plot of ",colnames(moneyball_MF)[i]))
boxplot(moneyball_MF[,i], col = "#A71930", main = paste("Boxplot of ",colnames(moneyball_MF)[i]))
title(sub = paste0("Number of Outliers = ", length(boxplot.stats(moneyball_MF[,i])$out)))
hist(
moneyball_MF[,i],
col = "#A71930",
xlab = colnames(moneyball_MF)[i],
main = paste("Histogram of ",colnames(moneyball_MF)[i])
)
i = i + 1
}
It looks like the outliers are going to be a problem for this model. Multiple techniques will be used to remediate this issue.
Now that step one is done, let’s take a look at step 2.
From the histogram above, we clearly see the data is not normal, with the exception of some that seems to sort of follow a normal distribution. Let’s use QQ-plot to test each column for normality, while adding a histogram and a Skewness number.
- If skewness is less than −1 or greater than +1, the distribution is highly skewed.
- If skewness is between −1 and −½ or between +½ and +1, the distribution is moderately skewed.
- If skewness is between −½ and +½, the distribution is approximately symmetric.
par(mfrow = c(2, 2))
i = 2
while (i %in% c(2:18)) {
qqnorm(moneyball_MF[,i], main = paste("QQ-Plot of ",colnames(moneyball_MF)[i]));qqline(moneyball_MF[,i], col = 2)
hist(
moneyball_MF[,i],
col = "#A71930",
xlab = colnames(moneyball_MF)[i],
main = paste0("Skewness = ",skewness(moneyball_MF[,i]))
)
i = i + 1
}
We would need to try certain transformation to correct for Skewness, with Box-Cox being the number one choice.QQ-plots are a great way to quickly gauge the normality of the variables.
Let’s create a series of correlation matix to understand how each independent variable interacts with the dependent variable. This correlation matix will help us spot any infrigement of the assupmtions needed to develop a robust OLS model, namely multicollinearity. The caret package can help the user find those pairs and even suggest which one to remove.
The Caret package offers the findcorrelation(), which takes the correlation matrix as an input and finds the fields causing multicollinearity based on a threshold, the cutoff parameter. It in turns returns a vector with values that would need to be removed from our dataset due to correlation.
colnames(moneyball_MF)[findCorrelation(cor(moneyball_MF))]
[1] "batting_hr" "batting_hbp"
Per caret’s suggestion, we need to remove two variables in order to deal with the multicollinearity issue, batting_hr and batting_hbp. We will keep that in mind for when we start the data transformation phase. For now, let’s keep them since we need them for more feature engineering. ## Data Transformation
Let’s introduce new variables through transformation:
batting_1B = batting_h-(batting_2b + batting_3b + batting_hr)free_bases_num = batting_hbp + batting_bbtotal_bases = batting_1B + 2 * batting_2b + 3 * batting_3b + 4 * batting_hr + batting_bb + batting_hbp + baserun_sbtotal_bases_allowed = pitching_bb + 4 * pitching_hr + pitching_hHR_over_OP = batting_hr - pitching_hrwalks_over_OP = batting_bb - pitching_bbSO_over_OP = pitching_so - batting_somoneyball_MF$batting_1B <- moneyball_MF$batting_h-(moneyball_MF$batting_2b + moneyball_MF$batting_3b + moneyball_MF$batting_hr)
moneyball_MF$free_bases_num <- if_else(is.na(moneyball_MF$batting_hbp),0,as.numeric(moneyball_MF$batting_hbp)) + moneyball_MF$batting_bb
moneyball_MF$total_bases <- moneyball_MF$batting_1B + 2 * moneyball_MF$batting_2b + 3 * moneyball_MF$batting_3b + 4 * moneyball_MF$batting_hr + moneyball_MF$batting_bb + if_else(is.na(moneyball_MF$batting_hbp),0,as.numeric(moneyball_MF$batting_hbp)) + moneyball_MF$baserun_sb
moneyball_MF$total_bases_allowed = moneyball_MF$pitching_bb + 4 * moneyball_MF$pitching_hr + moneyball_MF$pitching_h
moneyball_MF$HR_over_OP = moneyball_MF$batting_hr - moneyball_MF$pitching_hr
moneyball_MF$walks_over_OP = moneyball_MF$batting_bb - moneyball_MF$pitching_bb
moneyball_MF$SO_over_OP = moneyball_MF$pitching_so - moneyball_MF$batting_so
# make alist of predictors and format them. This will make it easier when it comes to manually chose variables for the model.
pred_list <-
"index + target_wins + batting_h + batting_2b + batting_3b + batting_hr +
batting_bb + batting_so + baserun_sb + baserun_cs + pitching_h + pitching_hr +
pitching_bb + pitching_so + fielding_e + fielding_dp + batting_hbp + batting_hbp_bi +
batting_1B + free_bases_num + total_bases + total_bases_allowed + HR_over_OP + walks_over_OP + SO_over_OP"
#keep the new variables in a vector for texting later, in cae they don't prove to be of any value.
new_var <- c("batting_1B","free_bases_num","total_bases","total_bases_allowed","HR_over_OP","walks_over_OP","SO_over_OP")
Now that we have imputed and created new variables, let’s look at the correlation matrix to understand the correlation between the variables and the traget_wins. Remember when caret suggested to delete batting_hr and batting_hbp from our model? Let’s build a correlaion matrix to understand why.
moneyball_MF <- subset(moneyball_MF, select = -c(batting_hbp))
cor(moneyball_MF)
index target_wins batting_h batting_2b
index 1.000000000 -0.021056435 -0.017920241 0.011183013
target_wins -0.021056435 1.000000000 0.388767521 0.289103645
batting_h -0.017920241 0.388767521 1.000000000 0.562849678
batting_2b 0.011183013 0.289103645 0.562849678 1.000000000
batting_3b -0.005814683 0.142608411 0.427696575 -0.107305824
batting_hr 0.051481047 0.176153200 -0.006544685 0.435397293
batting_bb -0.026567236 0.232559864 -0.072464013 0.255726103
batting_so 0.080101772 -0.036340851 -0.443167361 0.168544347
baserun_sb 0.028009343 0.124531876 0.117377219 -0.195729618
baserun_cs -0.033858755 0.011952527 -0.054061218 -0.400716260
pitching_h 0.017103148 -0.109937054 0.302693709 0.023692188
pitching_hr 0.050985897 0.189013735 0.072853119 0.454550818
pitching_bb -0.015287513 0.124174536 0.094193027 0.178054204
pitching_so 0.054885070 -0.079549477 -0.243368070 0.067229591
fielding_e -0.009233126 -0.176484759 0.264902478 -0.235150986
fielding_dp 0.007225231 -0.004802168 0.002545126 0.311431506
batting_hbp_bi 0.047332196 0.002610647 0.019594018 0.361922796
batting_1B -0.047074417 0.217430135 0.827584756 0.087009889
free_bases_num -0.019063695 0.228098279 -0.068377971 0.297591911
total_bases 0.023117000 0.481476330 0.629194622 0.706367022
total_bases_allowed 0.023268954 -0.059959123 0.314205398 0.119290484
HR_over_OP -0.000553440 -0.060991072 -0.322055891 -0.099453882
walks_over_OP -0.004745951 0.052184113 -0.162824365 0.011599182
SO_over_OP 0.020678412 -0.069498989 -0.048085266 -0.009574324
batting_3b batting_hr batting_bb batting_so baserun_sb
index -0.005814683 0.051481047 -0.02656724 0.08010177 0.02800934
target_wins 0.142608411 0.176153200 0.23255986 -0.03634085 0.12453188
batting_h 0.427696575 -0.006544685 -0.07246401 -0.44316736 0.11737722
batting_2b -0.107305824 0.435397293 0.25572610 0.16854435 -0.19572962
batting_3b 1.000000000 -0.635566946 -0.28723584 -0.66778072 0.55085401
batting_hr -0.635566946 1.000000000 0.51373481 0.71404750 -0.50180821
batting_bb -0.287235841 0.513734810 1.00000000 0.38059524 -0.27348943
batting_so -0.667780716 0.714047497 0.38059524 1.00000000 -0.27601969
baserun_sb 0.550854012 -0.501808209 -0.27348943 -0.27601969 1.00000000
baserun_cs 0.582277676 -0.716449484 -0.35478148 -0.39964841 0.76456596
pitching_h 0.194879411 -0.250145481 -0.44977762 -0.37103703 0.13838488
pitching_hr -0.567836679 0.969371396 0.45955207 0.65510882 -0.45378409
pitching_bb -0.002224148 0.136927564 0.48936126 0.04430509 0.06813425
pitching_so -0.260565688 0.186007336 -0.01626540 0.41798506 0.04230928
fielding_e 0.509778447 -0.587339098 -0.65597081 -0.58489244 0.53621429
fielding_dp -0.416934260 0.543299808 0.47186671 0.32031950 -0.56479348
batting_hbp_bi -0.265544426 0.392199209 0.10305838 0.39637912 -0.13679819
batting_1B 0.600399234 -0.497294855 -0.35312165 -0.74883066 0.31898468
free_bases_num -0.316009005 0.553966941 0.99101046 0.42472890 -0.28538081
total_bases 0.030511245 0.602463886 0.56903995 0.20369150 0.01888211
total_bases_allowed 0.092039617 -0.062551344 -0.30004852 -0.24212411 0.06481079
HR_over_OP -0.243354524 0.074559388 0.19441460 0.20374915 -0.17000250
walks_over_OP -0.231156161 0.266798215 0.27356493 0.26067271 -0.29757731
SO_over_OP 0.044315574 -0.149264697 -0.20651767 -0.03579320 0.18333956
baserun_cs pitching_h pitching_hr pitching_bb pitching_so
index -0.03385876 0.01710315 0.05098590 -0.015287513 0.054885070
target_wins 0.01195253 -0.10993705 0.18901373 0.124174536 -0.079549477
batting_h -0.05406122 0.30269371 0.07285312 0.094193027 -0.243368070
batting_2b -0.40071626 0.02369219 0.45455082 0.178054204 0.067229591
batting_3b 0.58227768 0.19487941 -0.56783668 -0.002224148 -0.260565688
batting_hr -0.71644948 -0.25014548 0.96937140 0.136927564 0.186007336
batting_bb -0.35478148 -0.44977762 0.45955207 0.489361263 -0.016265401
batting_so -0.39964841 -0.37103703 0.65510882 0.044305086 0.417985061
baserun_sb 0.76456596 0.13838488 -0.45378409 0.068134245 0.042309281
baserun_cs 1.00000000 0.05377149 -0.69693011 -0.084099713 -0.053742796
pitching_h 0.05377149 1.00000000 -0.14161276 0.320676162 0.267731177
pitching_hr -0.69693011 -0.14161276 1.00000000 0.221937505 0.205843409
pitching_bb -0.08409971 0.32067616 0.22193750 1.000000000 0.485143295
pitching_so -0.05374280 0.26773118 0.20584341 0.485143295 1.000000000
fielding_e 0.44615087 0.66775901 -0.49314447 -0.022837561 -0.024872279
fielding_dp -0.63128843 -0.23665566 0.50810200 0.133524785 0.019019632
batting_hbp_bi -0.24860955 -0.06445004 0.35794984 -0.016906833 0.133166935
batting_1B 0.29511882 0.40612014 -0.41549520 -0.022820326 -0.328235567
free_bases_num -0.37950680 -0.44800796 0.49652206 0.476195183 0.001936213
total_bases -0.31997018 -0.10543152 0.62785299 0.362193063 -0.027539432
total_bases_allowed -0.07470906 0.97499650 0.05669475 0.459579945 0.347350722
HR_over_OP -0.04357384 -0.42822141 -0.17264012 -0.351988418 -0.089804877
walks_over_OP -0.19578725 -0.71949139 0.12897043 -0.704942270 -0.548312806
SO_over_OP 0.13894686 0.47840952 -0.09823344 0.511731836 0.892910758
fielding_e fielding_dp batting_hbp_bi batting_1B
index -0.009233126 0.007225231 0.047332196 -0.04707442
target_wins -0.176484759 -0.004802168 0.002610647 0.21743014
batting_h 0.264902478 0.002545126 0.019594018 0.82758476
batting_2b -0.235150986 0.311431506 0.361922796 0.08700989
batting_3b 0.509778447 -0.416934260 -0.265544426 0.60039923
batting_hr -0.587339098 0.543299808 0.392199209 -0.49729485
batting_bb -0.655970815 0.471866710 0.103058382 -0.35312165
batting_so -0.584892437 0.320319496 0.396379123 -0.74883066
baserun_sb 0.536214293 -0.564793484 -0.136798191 0.31898468
baserun_cs 0.446150872 -0.631288427 -0.248609547 0.29511882
pitching_h 0.667759010 -0.236655659 -0.064450039 0.40612014
pitching_hr -0.493144466 0.508101997 0.357949841 -0.41549520
pitching_bb -0.022837561 0.133524785 -0.016906833 -0.02282033
pitching_so -0.024872279 0.019019632 0.133166935 -0.32823557
fielding_e 1.000000000 -0.554393215 -0.185315470 0.54781641
fielding_dp -0.554393215 1.000000000 0.113856635 -0.27499781
batting_hbp_bi -0.185315470 0.113856635 1.000000000 -0.23605172
batting_1B 0.547816415 -0.274997811 -0.236051718 1.00000000
free_bases_num -0.665319984 0.475662215 0.231848863 -0.37639588
total_bases -0.269071226 0.310545770 0.296371066 0.15968462
total_bases_allowed 0.557252830 -0.127317069 -0.003909755 0.31851323
HR_over_OP -0.353210656 0.115856978 0.119531251 -0.30736736
walks_over_OP -0.508313405 0.236499900 0.102464739 -0.26202481
SO_over_OP 0.262514082 -0.137828882 -0.049954800 0.01004295
free_bases_num total_bases total_bases_allowed HR_over_OP
index -0.019063695 0.02311700 0.023268954 -0.00055344
target_wins 0.228098279 0.48147633 -0.059959123 -0.06099107
batting_h -0.068377971 0.62919462 0.314205398 -0.32205589
batting_2b 0.297591911 0.70636702 0.119290484 -0.09945388
batting_3b -0.316009005 0.03051124 0.092039617 -0.24335452
batting_hr 0.553966941 0.60246389 -0.062551344 0.07455939
batting_bb 0.991010459 0.56903995 -0.300048525 0.19441460
batting_so 0.424728895 0.20369150 -0.242124109 0.20374915
baserun_sb -0.285380807 0.01888211 0.064810789 -0.17000250
baserun_cs -0.379506798 -0.31997018 -0.074709058 -0.04357384
pitching_h -0.448007961 -0.10543152 0.974996503 -0.42822141
pitching_hr 0.496522065 0.62785299 0.056694753 -0.17264012
pitching_bb 0.476195183 0.36219306 0.459579945 -0.35198842
pitching_so 0.001936213 -0.02753943 0.347350722 -0.08980488
fielding_e -0.665319984 -0.26907123 0.557252830 -0.35321066
fielding_dp 0.475662215 0.31054577 -0.127317069 0.11585698
batting_hbp_bi 0.231848863 0.29637107 -0.003909755 0.11953125
batting_1B -0.376395883 0.15968462 0.318513233 -0.30736736
free_bases_num 1.000000000 0.59553177 -0.293643548 0.20565630
total_bases 0.595531775 1.00000000 0.045056962 -0.13309286
total_bases_allowed -0.293643548 0.04505696 1.000000000 -0.48106409
HR_over_OP 0.205656303 -0.13309286 -0.481064087 1.00000000
walks_over_OP 0.280775130 0.06332353 -0.750919119 0.54633988
SO_over_OP -0.208367521 -0.13124558 0.502106467 -0.19977025
walks_over_OP SO_over_OP
index -0.004745951 0.020678412
target_wins 0.052184113 -0.069498989
batting_h -0.162824365 -0.048085266
batting_2b 0.011599182 -0.009574324
batting_3b -0.231156161 0.044315574
batting_hr 0.266798215 -0.149264697
batting_bb 0.273564933 -0.206517665
batting_so 0.260672708 -0.035793197
baserun_sb -0.297577306 0.183339558
baserun_cs -0.195787247 0.138946864
pitching_h -0.719491389 0.478409518
pitching_hr 0.128970430 -0.098233441
pitching_bb -0.704942270 0.511731836
pitching_so -0.548312806 0.892910758
fielding_e -0.508313405 0.262514082
fielding_dp 0.236499900 -0.137828882
batting_hbp_bi 0.102464739 -0.049954800
batting_1B -0.262024813 0.010042946
free_bases_num 0.280775130 -0.208367521
total_bases 0.063323533 -0.131245584
total_bases_allowed -0.750919119 0.502106467
HR_over_OP 0.546339879 -0.199770252
walks_over_OP 1.000000000 -0.732370782
SO_over_OP -0.732370782 1.000000000
Now that we created new variables, let’s see what caret has to say about which variables to remove.
colnames(moneyball_MF)[findCorrelation(cor(moneyball_MF), cutoff = 0.9)]
[1] "batting_hr" "free_bases_num" "pitching_h"
It suggesting batting_hr together with free_bases_num and pitching_h. According to the correlation matrix, batting_hr has a coefficient of correlation of 0.96 related to pitching_hr, free_bases_num has a coefficient of correlation of 0.99 related to batting_bb, and pitching_h has a coefficient of correlation of 0.99 related to total_bases_allowed. All these variables had a correlation of above th cuttoff point, 0.9. Let’s remove those variables.
moneyball_MF <- subset(moneyball_MF, select = -c(batting_hr, free_bases_num, pitching_h))
pred_list <- "index + target_wins + batting_h + batting_2b + batting_3b + batting_bb + batting_so + baserun_sb + baserun_cs + pitching_hr + pitching_bb + pitching_so + fielding_e + fielding_dp + batting_hbp_bi + batting_1B + total_bases + total_bases_allowed + HR_over_OP + walks_over_OP + SO_over_OP"
Let’s test a model to establish a baseline
str(moneyball_MF)
'data.frame': 2276 obs. of 21 variables:
$ index : num 1 2 3 4 5 6 7 8 11 12 ...
$ target_wins : num 39 70 86 70 82 75 80 85 86 76 ...
$ batting_h : num 1445 1339 1377 1387 1297 ...
$ batting_2b : num 194 219 232 209 186 200 179 171 197 213 ...
$ batting_3b : num 39 22 35 38 27 36 54 37 40 18 ...
$ batting_bb : num 143 685 602 451 472 443 525 456 447 441 ...
$ batting_so : num 842 1075 917 922 920 ...
$ baserun_sb : num 306 37 46 43 49 ...
$ baserun_cs : num 98 28 27 30 39 59 54 36 27 34 ...
$ pitching_hr : num 84 191 137 97 102 92 122 116 114 96 ...
$ pitching_bb : num 927 689 602 454 472 443 525 459 447 441 ...
$ pitching_so : num 5456 1082 917 928 920 ...
$ fielding_e : num 1011 193 175 164 138 ...
$ fielding_dp : num 114 155 153 156 168 ...
$ batting_hbp_bi : num 0 0 0 0 0 0 0 0 0 0 ...
$ batting_1B : num 1199 908 973 1044 982 ...
$ total_bases : num 2205 2894 2738 2454 2364 ...
$ total_bases_allowed: num 10627 2800 2527 2238 2177 ...
$ HR_over_OP : num -71 -1 0 -1 0 0 0 -1 0 0 ...
$ walks_over_OP : num -784 -4 0 -3 0 0 0 -3 0 0 ...
$ SO_over_OP : num 4614 7 0 6 0 ...
base_model_all <- lm(target_wins ~ batting_h + batting_2b + batting_3b + batting_bb + batting_so + baserun_sb + baserun_cs + pitching_hr + pitching_bb + pitching_so + fielding_e + fielding_dp + batting_hbp_bi + batting_1B + total_bases + total_bases_allowed + HR_over_OP + walks_over_OP + SO_over_OP, data = moneyball_MF)
par(mfrow = c(2,2))
plot(base_model_all)
summary(base_model_all)
Call:
lm(formula = target_wins ~ batting_h + batting_2b + batting_3b +
batting_bb + batting_so + baserun_sb + baserun_cs + pitching_hr +
pitching_bb + pitching_so + fielding_e + fielding_dp + batting_hbp_bi +
batting_1B + total_bases + total_bases_allowed + HR_over_OP +
walks_over_OP + SO_over_OP, data = moneyball_MF)
Residuals:
Min 1Q Median 3Q Max
-53.274 -8.380 0.104 8.198 52.796
Coefficients: (3 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.9759869 5.6532126 5.479 4.74e-08 ***
batting_h -0.1686270 0.2841790 -0.593 0.5530
batting_2b 0.0538920 0.1441774 0.374 0.7086
batting_3b 0.0105013 0.0764852 0.137 0.8908
batting_bb -0.0633108 0.0709613 -0.892 0.3724
batting_so -0.0147397 0.0025751 -5.724 1.18e-08 ***
baserun_sb -0.0254949 0.0707987 -0.360 0.7188
baserun_cs 0.0232297 0.0156060 1.489 0.1368
pitching_hr 0.0055673 0.0237393 0.235 0.8146
pitching_bb -0.0041737 0.0042028 -0.993 0.3208
pitching_so 0.0015648 0.0008913 1.756 0.0793 .
fielding_e -0.0357029 0.0026097 -13.681 < 2e-16 ***
fielding_dp -0.1209604 0.0140283 -8.623 < 2e-16 ***
batting_hbp_bi -8.3439489 4.3289726 -1.927 0.0540 .
batting_1B 0.1373699 0.2138647 0.642 0.5207
total_bases 0.0748094 0.0707005 1.058 0.2901
total_bases_allowed 0.0005650 0.0003710 1.523 0.1279
HR_over_OP NA NA NA NA
walks_over_OP NA NA NA NA
SO_over_OP NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 12.62 on 2259 degrees of freedom
Multiple R-squared: 0.3626, Adjusted R-squared: 0.3581
F-statistic: 80.32 on 16 and 2259 DF, p-value: < 2.2e-16
mse <- function(sm)
mean(sm$residuals^2)
paste('MSE equal ', mse(base_model_all), "and RMSE is ", sqrt(mse(base_model_all)))
[1] "MSE equal 158.085424102926 and RMSE is 12.5732026191789"
Though R-squared and adjusted R-square is decent, we can clearly see that this model is not optmal. Let’s try to forget about the new additions, and build a model without them.
Let’s fix the issue with outliers and see if we get any improvements. For the first approach we will use Winsoring approch.
For every outlier we will impute it with Q1 - 1.5*IQR or Q3 + 1.5*IQR, the cutoff for outliers.
outlier_treat <- moneyball_MF[,-c(1,15)]
comp_data <- moneyball_MF[,-c(1,15)]
i = 1
while (i %in% seq_along(outlier_treat)) {
qnt <- quantile(outlier_treat[,i], probs = c(.25, .75), na.rm = T)
caps <- quantile(outlier_treat[,i], probs = c(.05, .95), na.rm = T)
H <- 1.5 * IQR(outlier_treat[,i], na.rm = T)
outlier_treat[,i][outlier_treat[,i] < (qnt[1] - H)] <- caps[1]
outlier_treat[,i][outlier_treat[,i] > (qnt[2] + H)] <- caps[2]
par(mfrow = c(1,2))
plot(outlier_treat$target_wins, outlier_treat[,i], xlab = colnames(outlier_treat)[i] , ylab = "Target Wins", main = paste("Treated Scatter Plot of ",colnames(outlier_treat)[i]))
plot(comp_data$target_wins, comp_data[,i],xlab = colnames(comp_data)[i] , ylab = "Target Wins", main = paste("Scatter Plot of ",colnames(comp_data)[i]))
i = i + 1
}
#add back the columns that we dropped prior to the outlier treatment
outlier_treat <- cbind(outlier_treat,moneyball_MF[,c(1,15)])
Let’s try different models using the new data.
base_model_orig <-
lm(target_wins ~ batting_h + batting_2b + batting_3b + batting_bb + batting_so + baserun_sb + baserun_cs + pitching_hr + pitching_bb + pitching_so + fielding_e + fielding_dp + batting_hbp_bi + batting_1B + total_bases + total_bases_allowed + HR_over_OP + walks_over_OP + SO_over_OP, data = outlier_treat)
par(mfrow = c(2, 2))
plot(base_model_orig)
summary(base_model_orig)
Call:
lm(formula = target_wins ~ batting_h + batting_2b + batting_3b +
batting_bb + batting_so + baserun_sb + baserun_cs + pitching_hr +
pitching_bb + pitching_so + fielding_e + fielding_dp + batting_hbp_bi +
batting_1B + total_bases + total_bases_allowed + HR_over_OP +
walks_over_OP + SO_over_OP, data = outlier_treat)
Residuals:
Min 1Q Median 3Q Max
-43.484 -8.131 0.185 7.880 56.334
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 35.505321 6.161063 5.763 9.40e-09 ***
batting_h 0.004682 0.013668 0.343 0.73195
batting_2b -0.039914 0.014587 -2.736 0.00626 **
batting_3b 0.045395 0.022547 2.013 0.04420 *
batting_bb 0.025004 0.008987 2.782 0.00544 **
batting_so -0.012517 0.005356 -2.337 0.01952 *
baserun_sb 0.058822 0.008732 6.736 2.06e-11 ***
baserun_cs 0.004441 0.017225 0.258 0.79658
pitching_hr -0.019794 0.018310 -1.081 0.27977
pitching_bb -0.035758 0.007633 -4.684 2.97e-06 ***
pitching_so -0.004950 0.004976 -0.995 0.31992
fielding_e -0.043141 0.003086 -13.981 < 2e-16 ***
fielding_dp -0.102315 0.013302 -7.691 2.16e-14 ***
batting_hbp_bi -4.593192 1.129644 -4.066 4.95e-05 ***
batting_1B -0.011594 0.012932 -0.896 0.37009
total_bases 0.024300 0.004496 5.404 7.20e-08 ***
total_bases_allowed 0.011993 0.002032 5.903 4.11e-09 ***
HR_over_OP -0.017831 0.081821 -0.218 0.82750
walks_over_OP 0.031357 0.010865 2.886 0.00394 **
SO_over_OP 0.010300 0.004514 2.282 0.02260 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 11.98 on 2256 degrees of freedom
Multiple R-squared: 0.3481, Adjusted R-squared: 0.3427
F-statistic: 63.42 on 19 and 2256 DF, p-value: < 2.2e-16
paste('MSE equal ', mse(base_model_orig))
[1] "MSE equal 142.276852725476"
This model looks good, from a performance point of view(r2), but when I look at the variance of the residual I don’t feel secure. Specially after analyising Cook’s distance graph. There are several observations that are way out from the rest. Let’s build another model including only those with low p-Values.
base_model_lp <-
lm(target_wins ~ batting_2b + batting_3b + batting_bb + batting_so + baserun_sb + pitching_bb +
fielding_e + fielding_dp + batting_hbp_bi + total_bases + total_bases_allowed + walks_over_OP + SO_over_OP, data = outlier_treat)
par(mfrow = c(2, 2))
plot(base_model_lp)
summary(base_model_lp)
Call:
lm(formula = target_wins ~ batting_2b + batting_3b + batting_bb +
batting_so + baserun_sb + pitching_bb + fielding_e + fielding_dp +
batting_hbp_bi + total_bases + total_bases_allowed + walks_over_OP +
SO_over_OP, data = outlier_treat)
Residuals:
Min 1Q Median 3Q Max
-42.901 -8.097 0.161 7.936 59.030
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 32.731882 3.292267 9.942 < 2e-16 ***
batting_2b -0.030454 0.009036 -3.370 0.000764 ***
batting_3b 0.056393 0.017717 3.183 0.001477 **
batting_bb 0.029599 0.008439 3.507 0.000461 ***
batting_so -0.016779 0.001805 -9.298 < 2e-16 ***
baserun_sb 0.061590 0.005731 10.747 < 2e-16 ***
pitching_bb -0.036331 0.007421 -4.896 1.05e-06 ***
fielding_e -0.042373 0.002943 -14.395 < 2e-16 ***
fielding_dp -0.104205 0.012926 -8.062 1.21e-15 ***
batting_hbp_bi -4.331761 1.088969 -3.978 7.17e-05 ***
total_bases 0.021730 0.002414 9.000 < 2e-16 ***
total_bases_allowed 0.010745 0.001491 7.206 7.80e-13 ***
walks_over_OP 0.030961 0.010464 2.959 0.003120 **
SO_over_OP 0.008720 0.003918 2.225 0.026148 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 11.97 on 2262 degrees of freedom
Multiple R-squared: 0.3473, Adjusted R-squared: 0.3435
F-statistic: 92.57 on 13 and 2262 DF, p-value: < 2.2e-16
paste('MSE equal ', mse(base_model_lp))
[1] "MSE equal 142.46925592967"
Though the rsquared value went down, there are some improvements on the Cook’s distance chart. Now let’s try to use use the caret package to apply the transformations we discussed earlier in our exploration phase. I will include all the variables minus the ones cause Multicollinearity issues.
trans <- preProcess(outlier_treat, method = c("BoxCox","center", "scale"))
transformed <- predict(trans, outlier_treat)
head(transformed)
target_wins batting_h batting_2b batting_3b batting_bb batting_so baserun_sb
1 -1.76727657 -0.1111056 -1.0260875 -0.5976418 -2.0900610 0.4519711 2.1194831
2 -0.76118502 -1.0523406 -0.4551888 -1.2453898 1.8582866 1.4076716 -1.1160520
3 0.31687312 -0.7066133 -0.1661113 -0.7500531 0.9112549 0.7596000 -1.0079124
4 -0.76118502 -0.6172179 -0.6810714 -0.6357446 -0.5850077 0.7801086 -1.0439589
5 0.04120325 -1.4460647 -1.2133573 -1.0548757 -0.3951198 0.7719052 -0.9718659
6 -0.43149374 -1.6187221 -0.8871532 -0.7119502 -0.6557485 0.9892963 -0.2749666
baserun_cs pitching_hr pitching_bb pitching_so fielding_e fielding_dp batting_1B
1 0.7645702 -0.35375893 1.8815144 1.6520725 1.73665912 -1.0567187 1.2688903
2 -1.1522098 1.40662888 1.3308308 1.2672943 0.25915028 0.4643822 -1.7915976
3 -1.1795924 0.51820887 0.5763990 0.5375424 0.06267378 0.3888116 -0.9158042
4 -1.0974447 -0.13988003 -0.8749743 0.5861926 -0.07748718 0.5022409 -0.1066855
5 -0.8510015 -0.05761892 -0.6845719 0.5508107 -0.49255669 0.9602644 -0.8054425
6 -0.3033501 -0.22214115 -0.9935698 0.7852158 -0.80732240 0.2382644 -1.1961690
total_bases total_bases_allowed HR_over_OP walks_over_OP SO_over_OP index
1 -1.89010226 2.11040379 -2.9180993 -2.2942806 2.5391943 -2.185553
2 0.45396364 0.59390081 0.5315340 0.6102467 -0.5068186 -2.175875
3 -0.07698403 -0.05571369 0.6815181 0.6689240 -0.5617416 -2.167613
4 -1.04358107 -0.88758315 0.5315340 0.6249160 -0.5146647 -2.160153
5 -1.34989704 -1.08694911 0.6815181 0.6689240 -0.5617416 -2.153239
6 -1.30565140 -1.38831137 0.6815181 0.6689240 -0.5617416 -2.146731
batting_hbp_bi
1 -0.3025995
2 -0.3025995
3 -0.3025995
4 -0.3025995
5 -0.3025995
6 -0.3025995
trans_model_all <-
lm(target_wins ~ batting_h + batting_2b + batting_3b + batting_bb + batting_so + baserun_sb + baserun_cs + pitching_hr + pitching_bb + pitching_so + fielding_e + fielding_dp + batting_hbp_bi + batting_1B + total_bases + total_bases_allowed + HR_over_OP + walks_over_OP + SO_over_OP, data = transformed)
par(mfrow = c(2, 2))
plot(trans_model_all)
summary(trans_model_all)
Call:
lm(formula = target_wins ~ batting_h + batting_2b + batting_3b +
batting_bb + batting_so + baserun_sb + baserun_cs + pitching_hr +
pitching_bb + pitching_so + fielding_e + fielding_dp + batting_hbp_bi +
batting_1B + total_bases + total_bases_allowed + HR_over_OP +
walks_over_OP + SO_over_OP, data = transformed)
Residuals:
Min 1Q Median 3Q Max
-3.0206 -0.5342 -0.0091 0.5219 4.0756
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.234e-13 1.689e-02 0.000 1.00000
batting_h -5.843e-02 1.012e-01 -0.577 0.56368
batting_2b -4.598e-02 4.152e-02 -1.107 0.26830
batting_3b 1.593e-01 3.933e-02 4.051 5.27e-05 ***
batting_bb 1.826e-01 5.560e-02 3.284 0.00104 **
batting_so -2.285e-01 8.266e-02 -2.765 0.00575 **
baserun_sb 1.581e-01 4.668e-02 3.387 0.00072 ***
baserun_cs 1.716e-01 4.307e-02 3.985 6.97e-05 ***
pitching_hr -1.039e-02 7.243e-02 -0.143 0.88597
pitching_bb -1.990e-01 4.685e-02 -4.248 2.24e-05 ***
pitching_so -7.987e-02 7.166e-02 -1.115 0.26515
fielding_e -5.867e-01 3.878e-02 -15.128 < 2e-16 ***
fielding_dp -1.999e-01 2.398e-02 -8.337 < 2e-16 ***
batting_hbp_bi -1.426e-01 2.138e-02 -6.670 3.20e-11 ***
batting_1B 4.419e-02 7.775e-02 0.568 0.56986
total_bases 3.940e-01 8.618e-02 4.572 5.09e-06 ***
total_bases_allowed 2.453e-01 5.800e-02 4.229 2.45e-05 ***
HR_over_OP -6.412e-02 3.611e-02 -1.776 0.07593 .
walks_over_OP 2.012e-01 4.968e-02 4.050 5.29e-05 ***
SO_over_OP 4.417e-02 3.845e-02 1.149 0.25073
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8057 on 2256 degrees of freedom
Multiple R-squared: 0.3563, Adjusted R-squared: 0.3509
F-statistic: 65.72 on 19 and 2256 DF, p-value: < 2.2e-16
paste('MSE equal ', mse(trans_model_all))
[1] "MSE equal 0.643423619819869"
The residual plots look pretty good, with the exception of some possibly influential observation. Looking at Cook’s Distance, it’s clear that we have influential data, but the other charts look right where they should be.
Let’s look at another model using the same transformed data, but now looking only on the columns with low p-value.
trans_model_lp <-
lm(target_wins ~ batting_3b + batting_bb + batting_so + baserun_sb + baserun_cs + pitching_bb + fielding_e + fielding_dp + batting_hbp_bi + total_bases + total_bases_allowed + walks_over_OP + SO_over_OP, data = transformed)
par(mfrow = c(2, 2))
plot(trans_model_lp)
summary(trans_model_lp)
Call:
lm(formula = target_wins ~ batting_3b + batting_bb + batting_so +
baserun_sb + baserun_cs + pitching_bb + fielding_e + fielding_dp +
batting_hbp_bi + total_bases + total_bases_allowed + walks_over_OP +
SO_over_OP, data = transformed)
Residuals:
Min 1Q Median 3Q Max
-2.9477 -0.5553 -0.0108 0.5294 3.8245
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.742e-15 1.690e-02 0.000 1.000000
batting_3b 1.700e-01 3.257e-02 5.220 1.95e-07 ***
batting_bb 2.118e-01 5.038e-02 4.203 2.73e-05 ***
batting_so -3.135e-01 3.059e-02 -10.248 < 2e-16 ***
baserun_sb 1.697e-01 3.965e-02 4.279 1.96e-05 ***
baserun_cs 1.621e-01 4.220e-02 3.842 0.000125 ***
pitching_bb -1.980e-01 4.593e-02 -4.310 1.70e-05 ***
fielding_e -5.777e-01 3.794e-02 -15.227 < 2e-16 ***
fielding_dp -1.972e-01 2.371e-02 -8.319 < 2e-16 ***
batting_hbp_bi -1.565e-01 2.020e-02 -7.747 1.41e-14 ***
total_bases 3.385e-01 4.058e-02 8.343 < 2e-16 ***
total_bases_allowed 2.243e-01 4.222e-02 5.313 1.19e-07 ***
walks_over_OP 1.793e-01 4.772e-02 3.756 0.000177 ***
SO_over_OP 5.578e-02 3.351e-02 1.665 0.096147 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8064 on 2262 degrees of freedom
Multiple R-squared: 0.3534, Adjusted R-squared: 0.3497
F-statistic: 95.12 on 13 and 2262 DF, p-value: < 2.2e-16
paste('MSE equal ', mse(trans_model_lp))
[1] "MSE equal 0.646268171060538"
This model seems to be on par with the other models. I’ll try stepwise approaches and then I’ll see if removing “influencial” observations will improve the model. Let’s try, stepwise approach.
1. Both direction
stepwise_base_model_bd <- stepAIC(trans_model_all, direction = "both")
Start: AIC=-963.61
target_wins ~ batting_h + batting_2b + batting_3b + batting_bb +
batting_so + baserun_sb + baserun_cs + pitching_hr + pitching_bb +
pitching_so + fielding_e + fielding_dp + batting_hbp_bi +
batting_1B + total_bases + total_bases_allowed + HR_over_OP +
walks_over_OP + SO_over_OP
Df Sum of Sq RSS AIC
- pitching_hr 1 0.013 1464.5 -965.59
- batting_1B 1 0.210 1464.6 -965.28
- batting_h 1 0.216 1464.7 -965.27
- batting_2b 1 0.796 1465.2 -964.37
- pitching_so 1 0.806 1465.2 -964.35
- SO_over_OP 1 0.857 1465.3 -964.28
<none> 1464.4 -963.61
- HR_over_OP 1 2.047 1466.5 -962.43
- batting_so 1 4.961 1469.4 -957.91
- batting_bb 1 6.999 1471.4 -954.75
- baserun_sb 1 7.445 1471.9 -954.07
- baserun_cs 1 10.307 1474.7 -949.64
- walks_over_OP 1 10.649 1475.1 -949.12
- batting_3b 1 10.652 1475.1 -949.11
- total_bases_allowed 1 11.607 1476.0 -947.64
- pitching_bb 1 11.715 1476.2 -947.47
- total_bases 1 13.570 1478.0 -944.61
- batting_hbp_bi 1 28.880 1493.3 -921.16
- fielding_dp 1 45.121 1509.5 -896.54
- fielding_e 1 148.562 1613.0 -745.69
Step: AIC=-965.59
target_wins ~ batting_h + batting_2b + batting_3b + batting_bb +
batting_so + baserun_sb + baserun_cs + pitching_bb + pitching_so +
fielding_e + fielding_dp + batting_hbp_bi + batting_1B +
total_bases + total_bases_allowed + HR_over_OP + walks_over_OP +
SO_over_OP
Df Sum of Sq RSS AIC
- batting_h 1 0.231 1464.7 -967.23
- batting_1B 1 0.283 1464.7 -967.15
- pitching_so 1 0.823 1465.3 -966.31
- SO_over_OP 1 0.857 1465.3 -966.25
- batting_2b 1 0.942 1465.4 -966.12
<none> 1464.5 -965.59
- HR_over_OP 1 2.294 1466.7 -964.02
+ pitching_hr 1 0.013 1464.4 -963.61
- batting_so 1 5.041 1469.5 -959.77
- batting_bb 1 7.884 1472.3 -955.37
- baserun_sb 1 9.564 1474.0 -952.77
- baserun_cs 1 10.327 1474.8 -951.59
- walks_over_OP 1 10.755 1475.2 -950.93
- total_bases_allowed 1 11.689 1476.1 -949.49
- pitching_bb 1 11.702 1476.2 -949.47
- batting_3b 1 13.954 1478.4 -946.00
- total_bases 1 23.444 1487.9 -931.44
- batting_hbp_bi 1 29.915 1494.4 -921.56
- fielding_dp 1 45.495 1509.9 -897.95
- fielding_e 1 153.163 1617.6 -741.19
Step: AIC=-967.23
target_wins ~ batting_2b + batting_3b + batting_bb + batting_so +
baserun_sb + baserun_cs + pitching_bb + pitching_so + fielding_e +
fielding_dp + batting_hbp_bi + batting_1B + total_bases +
total_bases_allowed + HR_over_OP + walks_over_OP + SO_over_OP
Df Sum of Sq RSS AIC
- batting_1B 1 0.052 1464.7 -969.15
- pitching_so 1 0.727 1465.4 -968.10
- SO_over_OP 1 0.896 1465.6 -967.84
<none> 1464.7 -967.23
+ batting_h 1 0.231 1464.5 -965.59
- HR_over_OP 1 2.382 1467.1 -965.53
+ pitching_hr 1 0.028 1464.7 -965.27
- batting_2b 1 2.748 1467.4 -964.96
- batting_so 1 5.501 1470.2 -960.70
- batting_bb 1 8.694 1473.4 -955.76
- baserun_cs 1 10.184 1474.9 -953.46
- walks_over_OP 1 10.525 1475.2 -952.93
- baserun_sb 1 11.343 1476.0 -951.67
- pitching_bb 1 11.475 1476.2 -951.47
- total_bases_allowed 1 13.336 1478.0 -948.60
- batting_3b 1 14.284 1479.0 -947.14
- total_bases 1 29.252 1493.9 -924.22
- batting_hbp_bi 1 30.032 1494.7 -923.03
- fielding_dp 1 45.333 1510.0 -899.85
- fielding_e 1 152.953 1617.6 -743.16
Step: AIC=-969.15
target_wins ~ batting_2b + batting_3b + batting_bb + batting_so +
baserun_sb + baserun_cs + pitching_bb + pitching_so + fielding_e +
fielding_dp + batting_hbp_bi + total_bases + total_bases_allowed +
HR_over_OP + walks_over_OP + SO_over_OP
Df Sum of Sq RSS AIC
- pitching_so 1 0.748 1465.5 -969.99
- SO_over_OP 1 0.938 1465.7 -969.69
<none> 1464.7 -969.15
- HR_over_OP 1 2.345 1467.1 -967.51
+ pitching_hr 1 0.069 1464.7 -967.25
+ batting_1B 1 0.052 1464.7 -967.23
+ batting_h 1 0.000 1464.7 -967.15
- batting_2b 1 2.766 1467.5 -966.85
- batting_so 1 6.047 1470.8 -961.77
- batting_bb 1 8.644 1473.4 -957.75
- walks_over_OP 1 10.533 1475.3 -954.84
- baserun_cs 1 10.534 1475.3 -954.84
- baserun_sb 1 11.313 1476.0 -953.63
- pitching_bb 1 11.595 1476.3 -953.20
- total_bases_allowed 1 13.385 1478.1 -950.44
- batting_3b 1 14.301 1479.0 -949.03
- batting_hbp_bi 1 31.111 1495.8 -923.31
- total_bases 1 32.130 1496.9 -921.76
- fielding_dp 1 45.459 1510.2 -901.58
- fielding_e 1 152.916 1617.6 -745.14
Step: AIC=-969.99
target_wins ~ batting_2b + batting_3b + batting_bb + batting_so +
baserun_sb + baserun_cs + pitching_bb + fielding_e + fielding_dp +
batting_hbp_bi + total_bases + total_bases_allowed + HR_over_OP +
walks_over_OP + SO_over_OP
Df Sum of Sq RSS AIC
- SO_over_OP 1 0.549 1466.0 -971.13
<none> 1465.5 -969.99
+ pitching_so 1 0.748 1464.7 -969.15
- HR_over_OP 1 2.306 1467.8 -968.41
+ pitching_hr 1 0.103 1465.4 -968.15
+ batting_1B 1 0.073 1465.4 -968.10
+ batting_h 1 0.008 1465.5 -968.00
- batting_2b 1 3.047 1468.5 -967.26
- batting_bb 1 9.387 1474.9 -957.45
- baserun_cs 1 10.220 1475.7 -956.17
- walks_over_OP 1 10.380 1475.9 -955.92
- baserun_sb 1 10.865 1476.3 -955.17
- pitching_bb 1 12.250 1477.7 -953.04
- total_bases_allowed 1 14.016 1479.5 -950.32
- batting_3b 1 14.044 1479.5 -950.28
- batting_hbp_bi 1 31.989 1497.5 -922.84
- total_bases 1 43.128 1508.6 -905.97
- fielding_dp 1 44.849 1510.3 -903.38
- batting_so 1 70.292 1535.8 -865.35
- fielding_e 1 152.404 1617.9 -746.81
Step: AIC=-971.13
target_wins ~ batting_2b + batting_3b + batting_bb + batting_so +
baserun_sb + baserun_cs + pitching_bb + fielding_e + fielding_dp +
batting_hbp_bi + total_bases + total_bases_allowed + HR_over_OP +
walks_over_OP
Df Sum of Sq RSS AIC
<none> 1466.0 -971.13
+ SO_over_OP 1 0.549 1465.5 -969.99
+ pitching_so 1 0.358 1465.7 -969.69
+ pitching_hr 1 0.117 1465.9 -969.32
+ batting_1B 1 0.103 1465.9 -969.29
+ batting_h 1 0.011 1466.0 -969.15
- batting_2b 1 2.894 1468.9 -968.65
- HR_over_OP 1 3.868 1469.9 -967.14
- batting_bb 1 9.058 1475.1 -959.11
- walks_over_OP 1 10.010 1476.0 -957.65
- baserun_sb 1 10.805 1476.8 -956.42
- baserun_cs 1 11.578 1477.6 -955.23
- pitching_bb 1 12.014 1478.0 -954.56
- batting_3b 1 13.983 1480.0 -951.53
- total_bases_allowed 1 14.807 1480.8 -950.26
- batting_hbp_bi 1 32.068 1498.1 -923.88
- total_bases 1 42.729 1508.8 -907.74
- fielding_dp 1 44.892 1510.9 -904.48
- batting_so 1 73.739 1539.8 -861.44
- fielding_e 1 152.263 1618.3 -748.23
par(mfrow = c(2, 2))
plot(stepwise_base_model_bd)
summary(stepwise_base_model_bd)
Call:
lm(formula = target_wins ~ batting_2b + batting_3b + batting_bb +
batting_so + baserun_sb + baserun_cs + pitching_bb + fielding_e +
fielding_dp + batting_hbp_bi + total_bases + total_bases_allowed +
HR_over_OP + walks_over_OP, data = transformed)
Residuals:
Min 1Q Median 3Q Max
-3.0849 -0.5413 -0.0115 0.5286 4.1994
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.053e-15 1.688e-02 0.000 1.00000
batting_2b -5.759e-02 2.726e-02 -2.113 0.03475 *
batting_3b 1.535e-01 3.305e-02 4.644 3.62e-06 ***
batting_bb 1.918e-01 5.131e-02 3.738 0.00019 ***
batting_so -3.117e-01 2.923e-02 -10.664 < 2e-16 ***
baserun_sb 1.633e-01 4.000e-02 4.082 4.62e-05 ***
baserun_cs 1.757e-01 4.157e-02 4.226 2.48e-05 ***
pitching_bb -1.975e-01 4.589e-02 -4.305 1.74e-05 ***
fielding_e -5.833e-01 3.806e-02 -15.324 < 2e-16 ***
fielding_dp -1.973e-01 2.371e-02 -8.321 < 2e-16 ***
batting_hbp_bi -1.449e-01 2.061e-02 -7.033 2.68e-12 ***
total_bases 3.846e-01 4.738e-02 8.118 7.71e-16 ***
total_bases_allowed 2.045e-01 4.280e-02 4.779 1.88e-06 ***
HR_over_OP -7.377e-02 3.020e-02 -2.442 0.01467 *
walks_over_OP 1.786e-01 4.545e-02 3.929 8.78e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8052 on 2261 degrees of freedom
Multiple R-squared: 0.3556, Adjusted R-squared: 0.3516
F-statistic: 89.12 on 14 and 2261 DF, p-value: < 2.2e-16
paste('MSE equal ', mse(stepwise_base_model_bd))
[1] "MSE equal 0.644123273462486"
stepwise_base_model_fw <- stepAIC(trans_model_all, direction = "forward")
Start: AIC=-963.61
target_wins ~ batting_h + batting_2b + batting_3b + batting_bb +
batting_so + baserun_sb + baserun_cs + pitching_hr + pitching_bb +
pitching_so + fielding_e + fielding_dp + batting_hbp_bi +
batting_1B + total_bases + total_bases_allowed + HR_over_OP +
walks_over_OP + SO_over_OP
par(mfrow = c(2, 2))
plot(stepwise_base_model_fw)
summary(stepwise_base_model_fw)
Call:
lm(formula = target_wins ~ batting_h + batting_2b + batting_3b +
batting_bb + batting_so + baserun_sb + baserun_cs + pitching_hr +
pitching_bb + pitching_so + fielding_e + fielding_dp + batting_hbp_bi +
batting_1B + total_bases + total_bases_allowed + HR_over_OP +
walks_over_OP + SO_over_OP, data = transformed)
Residuals:
Min 1Q Median 3Q Max
-3.0206 -0.5342 -0.0091 0.5219 4.0756
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.234e-13 1.689e-02 0.000 1.00000
batting_h -5.843e-02 1.012e-01 -0.577 0.56368
batting_2b -4.598e-02 4.152e-02 -1.107 0.26830
batting_3b 1.593e-01 3.933e-02 4.051 5.27e-05 ***
batting_bb 1.826e-01 5.560e-02 3.284 0.00104 **
batting_so -2.285e-01 8.266e-02 -2.765 0.00575 **
baserun_sb 1.581e-01 4.668e-02 3.387 0.00072 ***
baserun_cs 1.716e-01 4.307e-02 3.985 6.97e-05 ***
pitching_hr -1.039e-02 7.243e-02 -0.143 0.88597
pitching_bb -1.990e-01 4.685e-02 -4.248 2.24e-05 ***
pitching_so -7.987e-02 7.166e-02 -1.115 0.26515
fielding_e -5.867e-01 3.878e-02 -15.128 < 2e-16 ***
fielding_dp -1.999e-01 2.398e-02 -8.337 < 2e-16 ***
batting_hbp_bi -1.426e-01 2.138e-02 -6.670 3.20e-11 ***
batting_1B 4.419e-02 7.775e-02 0.568 0.56986
total_bases 3.940e-01 8.618e-02 4.572 5.09e-06 ***
total_bases_allowed 2.453e-01 5.800e-02 4.229 2.45e-05 ***
HR_over_OP -6.412e-02 3.611e-02 -1.776 0.07593 .
walks_over_OP 2.012e-01 4.968e-02 4.050 5.29e-05 ***
SO_over_OP 4.417e-02 3.845e-02 1.149 0.25073
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8057 on 2256 degrees of freedom
Multiple R-squared: 0.3563, Adjusted R-squared: 0.3509
F-statistic: 65.72 on 19 and 2256 DF, p-value: < 2.2e-16
paste('MSE equal ', mse(stepwise_base_model_fw))
[1] "MSE equal 0.643423619819869"
stepwise_base_model_bw <- stepAIC(trans_model_all, direction = "backward")
Start: AIC=-963.61
target_wins ~ batting_h + batting_2b + batting_3b + batting_bb +
batting_so + baserun_sb + baserun_cs + pitching_hr + pitching_bb +
pitching_so + fielding_e + fielding_dp + batting_hbp_bi +
batting_1B + total_bases + total_bases_allowed + HR_over_OP +
walks_over_OP + SO_over_OP
Df Sum of Sq RSS AIC
- pitching_hr 1 0.013 1464.5 -965.59
- batting_1B 1 0.210 1464.6 -965.28
- batting_h 1 0.216 1464.7 -965.27
- batting_2b 1 0.796 1465.2 -964.37
- pitching_so 1 0.806 1465.2 -964.35
- SO_over_OP 1 0.857 1465.3 -964.28
<none> 1464.4 -963.61
- HR_over_OP 1 2.047 1466.5 -962.43
- batting_so 1 4.961 1469.4 -957.91
- batting_bb 1 6.999 1471.4 -954.75
- baserun_sb 1 7.445 1471.9 -954.07
- baserun_cs 1 10.307 1474.7 -949.64
- walks_over_OP 1 10.649 1475.1 -949.12
- batting_3b 1 10.652 1475.1 -949.11
- total_bases_allowed 1 11.607 1476.0 -947.64
- pitching_bb 1 11.715 1476.2 -947.47
- total_bases 1 13.570 1478.0 -944.61
- batting_hbp_bi 1 28.880 1493.3 -921.16
- fielding_dp 1 45.121 1509.5 -896.54
- fielding_e 1 148.562 1613.0 -745.69
Step: AIC=-965.59
target_wins ~ batting_h + batting_2b + batting_3b + batting_bb +
batting_so + baserun_sb + baserun_cs + pitching_bb + pitching_so +
fielding_e + fielding_dp + batting_hbp_bi + batting_1B +
total_bases + total_bases_allowed + HR_over_OP + walks_over_OP +
SO_over_OP
Df Sum of Sq RSS AIC
- batting_h 1 0.231 1464.7 -967.23
- batting_1B 1 0.283 1464.7 -967.15
- pitching_so 1 0.823 1465.3 -966.31
- SO_over_OP 1 0.857 1465.3 -966.25
- batting_2b 1 0.942 1465.4 -966.12
<none> 1464.5 -965.59
- HR_over_OP 1 2.294 1466.7 -964.02
- batting_so 1 5.041 1469.5 -959.77
- batting_bb 1 7.884 1472.3 -955.37
- baserun_sb 1 9.564 1474.0 -952.77
- baserun_cs 1 10.327 1474.8 -951.59
- walks_over_OP 1 10.755 1475.2 -950.93
- total_bases_allowed 1 11.689 1476.1 -949.49
- pitching_bb 1 11.702 1476.2 -949.47
- batting_3b 1 13.954 1478.4 -946.00
- total_bases 1 23.444 1487.9 -931.44
- batting_hbp_bi 1 29.915 1494.4 -921.56
- fielding_dp 1 45.495 1509.9 -897.95
- fielding_e 1 153.163 1617.6 -741.19
Step: AIC=-967.23
target_wins ~ batting_2b + batting_3b + batting_bb + batting_so +
baserun_sb + baserun_cs + pitching_bb + pitching_so + fielding_e +
fielding_dp + batting_hbp_bi + batting_1B + total_bases +
total_bases_allowed + HR_over_OP + walks_over_OP + SO_over_OP
Df Sum of Sq RSS AIC
- batting_1B 1 0.052 1464.7 -969.15
- pitching_so 1 0.727 1465.4 -968.10
- SO_over_OP 1 0.896 1465.6 -967.84
<none> 1464.7 -967.23
- HR_over_OP 1 2.382 1467.1 -965.53
- batting_2b 1 2.748 1467.4 -964.96
- batting_so 1 5.501 1470.2 -960.70
- batting_bb 1 8.694 1473.4 -955.76
- baserun_cs 1 10.184 1474.9 -953.46
- walks_over_OP 1 10.525 1475.2 -952.93
- baserun_sb 1 11.343 1476.0 -951.67
- pitching_bb 1 11.475 1476.2 -951.47
- total_bases_allowed 1 13.336 1478.0 -948.60
- batting_3b 1 14.284 1479.0 -947.14
- total_bases 1 29.252 1493.9 -924.22
- batting_hbp_bi 1 30.032 1494.7 -923.03
- fielding_dp 1 45.333 1510.0 -899.85
- fielding_e 1 152.953 1617.6 -743.16
Step: AIC=-969.15
target_wins ~ batting_2b + batting_3b + batting_bb + batting_so +
baserun_sb + baserun_cs + pitching_bb + pitching_so + fielding_e +
fielding_dp + batting_hbp_bi + total_bases + total_bases_allowed +
HR_over_OP + walks_over_OP + SO_over_OP
Df Sum of Sq RSS AIC
- pitching_so 1 0.748 1465.5 -969.99
- SO_over_OP 1 0.938 1465.7 -969.69
<none> 1464.7 -969.15
- HR_over_OP 1 2.345 1467.1 -967.51
- batting_2b 1 2.766 1467.5 -966.85
- batting_so 1 6.047 1470.8 -961.77
- batting_bb 1 8.644 1473.4 -957.75
- walks_over_OP 1 10.533 1475.3 -954.84
- baserun_cs 1 10.534 1475.3 -954.84
- baserun_sb 1 11.313 1476.0 -953.63
- pitching_bb 1 11.595 1476.3 -953.20
- total_bases_allowed 1 13.385 1478.1 -950.44
- batting_3b 1 14.301 1479.0 -949.03
- batting_hbp_bi 1 31.111 1495.8 -923.31
- total_bases 1 32.130 1496.9 -921.76
- fielding_dp 1 45.459 1510.2 -901.58
- fielding_e 1 152.916 1617.6 -745.14
Step: AIC=-969.99
target_wins ~ batting_2b + batting_3b + batting_bb + batting_so +
baserun_sb + baserun_cs + pitching_bb + fielding_e + fielding_dp +
batting_hbp_bi + total_bases + total_bases_allowed + HR_over_OP +
walks_over_OP + SO_over_OP
Df Sum of Sq RSS AIC
- SO_over_OP 1 0.549 1466.0 -971.13
<none> 1465.5 -969.99
- HR_over_OP 1 2.306 1467.8 -968.41
- batting_2b 1 3.047 1468.5 -967.26
- batting_bb 1 9.387 1474.9 -957.45
- baserun_cs 1 10.220 1475.7 -956.17
- walks_over_OP 1 10.380 1475.9 -955.92
- baserun_sb 1 10.865 1476.3 -955.17
- pitching_bb 1 12.250 1477.7 -953.04
- total_bases_allowed 1 14.016 1479.5 -950.32
- batting_3b 1 14.044 1479.5 -950.28
- batting_hbp_bi 1 31.989 1497.5 -922.84
- total_bases 1 43.128 1508.6 -905.97
- fielding_dp 1 44.849 1510.3 -903.38
- batting_so 1 70.292 1535.8 -865.35
- fielding_e 1 152.404 1617.9 -746.81
Step: AIC=-971.13
target_wins ~ batting_2b + batting_3b + batting_bb + batting_so +
baserun_sb + baserun_cs + pitching_bb + fielding_e + fielding_dp +
batting_hbp_bi + total_bases + total_bases_allowed + HR_over_OP +
walks_over_OP
Df Sum of Sq RSS AIC
<none> 1466.0 -971.13
- batting_2b 1 2.894 1468.9 -968.65
- HR_over_OP 1 3.868 1469.9 -967.14
- batting_bb 1 9.058 1475.1 -959.11
- walks_over_OP 1 10.010 1476.0 -957.65
- baserun_sb 1 10.805 1476.8 -956.42
- baserun_cs 1 11.578 1477.6 -955.23
- pitching_bb 1 12.014 1478.0 -954.56
- batting_3b 1 13.983 1480.0 -951.53
- total_bases_allowed 1 14.807 1480.8 -950.26
- batting_hbp_bi 1 32.068 1498.1 -923.88
- total_bases 1 42.729 1508.8 -907.74
- fielding_dp 1 44.892 1510.9 -904.48
- batting_so 1 73.739 1539.8 -861.44
- fielding_e 1 152.263 1618.3 -748.23
par(mfrow = c(2, 2))
plot(stepwise_base_model_bw)
summary(stepwise_base_model_bw)
Call:
lm(formula = target_wins ~ batting_2b + batting_3b + batting_bb +
batting_so + baserun_sb + baserun_cs + pitching_bb + fielding_e +
fielding_dp + batting_hbp_bi + total_bases + total_bases_allowed +
HR_over_OP + walks_over_OP, data = transformed)
Residuals:
Min 1Q Median 3Q Max
-3.0849 -0.5413 -0.0115 0.5286 4.1994
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.053e-15 1.688e-02 0.000 1.00000
batting_2b -5.759e-02 2.726e-02 -2.113 0.03475 *
batting_3b 1.535e-01 3.305e-02 4.644 3.62e-06 ***
batting_bb 1.918e-01 5.131e-02 3.738 0.00019 ***
batting_so -3.117e-01 2.923e-02 -10.664 < 2e-16 ***
baserun_sb 1.633e-01 4.000e-02 4.082 4.62e-05 ***
baserun_cs 1.757e-01 4.157e-02 4.226 2.48e-05 ***
pitching_bb -1.975e-01 4.589e-02 -4.305 1.74e-05 ***
fielding_e -5.833e-01 3.806e-02 -15.324 < 2e-16 ***
fielding_dp -1.973e-01 2.371e-02 -8.321 < 2e-16 ***
batting_hbp_bi -1.449e-01 2.061e-02 -7.033 2.68e-12 ***
total_bases 3.846e-01 4.738e-02 8.118 7.71e-16 ***
total_bases_allowed 2.045e-01 4.280e-02 4.779 1.88e-06 ***
HR_over_OP -7.377e-02 3.020e-02 -2.442 0.01467 *
walks_over_OP 1.786e-01 4.545e-02 3.929 8.78e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8052 on 2261 degrees of freedom
Multiple R-squared: 0.3556, Adjusted R-squared: 0.3516
F-statistic: 89.12 on 14 and 2261 DF, p-value: < 2.2e-16
paste('MSE equal ', mse(stepwise_base_model_bw))
[1] "MSE equal 0.644123273462486"
Let’s remove influential observations based on cook’s distance chart. We will remove the following observations: 1342, 1810, 1828, 2136, 1820, 2227,1340, 1811, 2233, 1896, 2020, 2228.
Those observations will be removed from these datasets: transformed and outlier_treat
outlier_treat_rm <- outlier_treat[-c(1342, 1810, 1828, 2136, 1820, 2227,1340, 1811, 2233, 1896, 2020, 2228),]
transformed_rm <- transformed[-c(1342, 1810, 1828, 2136, 1820, 2227,1340, 1811, 2233, 1896, 2020, 2228),]
base_model_orig_rm <-
lm(target_wins ~ batting_h + batting_2b + batting_3b + batting_bb + batting_so + baserun_sb + baserun_cs + pitching_hr + pitching_bb + pitching_so + fielding_e + fielding_dp + batting_hbp_bi + batting_1B + total_bases + total_bases_allowed + HR_over_OP + walks_over_OP + SO_over_OP, data = outlier_treat_rm)
par(mfrow = c(2, 2))
plot(base_model_orig_rm)
summary(base_model_orig_rm)
Call:
lm(formula = target_wins ~ batting_h + batting_2b + batting_3b +
batting_bb + batting_so + baserun_sb + baserun_cs + pitching_hr +
pitching_bb + pitching_so + fielding_e + fielding_dp + batting_hbp_bi +
batting_1B + total_bases + total_bases_allowed + HR_over_OP +
walks_over_OP + SO_over_OP, data = outlier_treat_rm)
Residuals:
Min 1Q Median 3Q Max
-36.510 -7.923 0.199 7.737 35.738
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.509e+01 6.062e+00 5.789 8.05e-09 ***
batting_h 2.290e-02 1.366e-02 1.676 0.0938 .
batting_2b -6.824e-02 1.517e-02 -4.498 7.21e-06 ***
batting_3b 3.075e-02 2.279e-02 1.349 0.1773
batting_bb 3.894e-02 9.692e-03 4.018 6.06e-05 ***
batting_so -1.705e-02 5.676e-03 -3.003 0.0027 **
baserun_sb 5.630e-02 8.754e-03 6.431 1.54e-10 ***
baserun_cs 2.526e-04 1.696e-02 0.015 0.9881
pitching_hr -4.925e-02 1.995e-02 -2.469 0.0136 *
pitching_bb -5.042e-02 8.141e-03 -6.194 6.96e-10 ***
pitching_so 8.544e-05 5.253e-03 0.016 0.9870
fielding_e -4.504e-02 3.042e-03 -14.805 < 2e-16 ***
fielding_dp -1.038e-01 1.299e-02 -7.992 2.10e-15 ***
batting_hbp_bi -4.458e+00 1.109e+00 -4.018 6.05e-05 ***
batting_1B -3.192e-02 1.331e-02 -2.399 0.0165 *
total_bases 2.742e-02 4.687e-03 5.850 5.63e-09 ***
total_bases_allowed 1.139e-02 2.093e-03 5.443 5.82e-08 ***
HR_over_OP -1.337e-01 8.175e-02 -1.635 0.1022
walks_over_OP 2.404e-02 1.078e-02 2.229 0.0259 *
SO_over_OP 6.916e-03 4.453e-03 1.553 0.1206
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 11.68 on 2244 degrees of freedom
Multiple R-squared: 0.3676, Adjusted R-squared: 0.3623
F-statistic: 68.66 on 19 and 2244 DF, p-value: < 2.2e-16
paste('MSE equal ', mse(base_model_orig_rm))
[1] "MSE equal 135.281854794309"
base_model_lp_rm <-
lm(target_wins ~ batting_2b + batting_3b + batting_bb + batting_so + baserun_sb + pitching_bb +
fielding_e + fielding_dp + batting_hbp_bi + total_bases + total_bases_allowed + walks_over_OP + SO_over_OP, data = outlier_treat_rm)
par(mfrow = c(2, 2))
plot(base_model_lp_rm)
summary(base_model_lp_rm)
Call:
lm(formula = target_wins ~ batting_2b + batting_3b + batting_bb +
batting_so + baserun_sb + pitching_bb + fielding_e + fielding_dp +
batting_hbp_bi + total_bases + total_bases_allowed + walks_over_OP +
SO_over_OP, data = outlier_treat_rm)
Residuals:
Min 1Q Median 3Q Max
-37.130 -7.902 0.184 7.860 36.231
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 32.157432 3.232795 9.947 < 2e-16 ***
batting_2b -0.035903 0.008968 -4.003 6.45e-05 ***
batting_3b 0.068862 0.017422 3.953 7.97e-05 ***
batting_bb 0.044466 0.009073 4.901 1.02e-06 ***
batting_so -0.016966 0.001771 -9.578 < 2e-16 ***
baserun_sb 0.060647 0.005628 10.776 < 2e-16 ***
pitching_bb -0.050230 0.008037 -6.250 4.90e-10 ***
fielding_e -0.043364 0.002890 -15.003 < 2e-16 ***
fielding_dp -0.105258 0.012634 -8.331 < 2e-16 ***
batting_hbp_bi -4.089404 1.064805 -3.841 0.000126 ***
total_bases 0.021326 0.002429 8.781 < 2e-16 ***
total_bases_allowed 0.011782 0.001485 7.932 3.37e-15 ***
walks_over_OP 0.023997 0.010402 2.307 0.021150 *
SO_over_OP 0.008021 0.003864 2.076 0.038028 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 11.69 on 2250 degrees of freedom
Multiple R-squared: 0.3651, Adjusted R-squared: 0.3615
F-statistic: 99.54 on 13 and 2250 DF, p-value: < 2.2e-16
paste('MSE equal ', mse(base_model_lp_rm))
[1] "MSE equal 135.815500400735"
trans_model_all_rm <-
lm(target_wins ~ batting_h + batting_2b + batting_3b + batting_bb + batting_so + baserun_sb + baserun_cs + pitching_hr + pitching_bb + pitching_so + fielding_e + fielding_dp + batting_hbp_bi + batting_1B + total_bases + total_bases_allowed + HR_over_OP + walks_over_OP + SO_over_OP, data = transformed_rm)
par(mfrow = c(2, 2))
plot(trans_model_all_rm)
summary(trans_model_all_rm)
Call:
lm(formula = target_wins ~ batting_h + batting_2b + batting_3b +
batting_bb + batting_so + baserun_sb + baserun_cs + pitching_hr +
pitching_bb + pitching_so + fielding_e + fielding_dp + batting_hbp_bi +
batting_1B + total_bases + total_bases_allowed + HR_over_OP +
walks_over_OP + SO_over_OP, data = transformed_rm)
Residuals:
Min 1Q Median 3Q Max
-2.56849 -0.53857 -0.00773 0.51266 2.60550
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.004146 0.016510 -0.251 0.801724
batting_h 0.062669 0.100623 0.623 0.533471
batting_2b -0.117553 0.042909 -2.740 0.006200 **
batting_3b 0.141785 0.039526 3.587 0.000342 ***
batting_bb 0.278455 0.058451 4.764 2.02e-06 ***
batting_so -0.318815 0.087236 -3.655 0.000263 ***
baserun_sb 0.144407 0.047105 3.066 0.002198 **
baserun_cs 0.156213 0.042597 3.667 0.000251 ***
pitching_hr -0.086129 0.077932 -1.105 0.269198
pitching_bb -0.287292 0.048809 -5.886 4.55e-09 ***
pitching_so 0.013010 0.075139 0.173 0.862557
fielding_e -0.606150 0.038131 -15.896 < 2e-16 ***
fielding_dp -0.200150 0.023391 -8.557 < 2e-16 ***
batting_hbp_bi -0.143371 0.021020 -6.821 1.16e-11 ***
batting_1B -0.051523 0.079544 -0.648 0.517223
total_bases 0.451258 0.089138 5.062 4.48e-07 ***
total_bases_allowed 0.179306 0.059328 3.022 0.002537 **
HR_over_OP -0.116654 0.036001 -3.240 0.001212 **
walks_over_OP 0.168838 0.049067 3.441 0.000590 ***
SO_over_OP 0.012120 0.037866 0.320 0.748930
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7853 on 2244 degrees of freedom
Multiple R-squared: 0.3753, Adjusted R-squared: 0.3701
F-statistic: 70.97 on 19 and 2244 DF, p-value: < 2.2e-16
paste('MSE equal ', mse(trans_model_all_rm))
[1] "MSE equal 0.611243383828704"
trans_model_lp_rm <-
lm(target_wins ~ batting_3b + batting_bb + batting_so + baserun_sb + baserun_cs + pitching_bb + fielding_e + fielding_dp + batting_hbp_bi + total_bases + total_bases_allowed + walks_over_OP + SO_over_OP, data = transformed_rm)
par(mfrow = c(2, 2))
plot(trans_model_lp_rm)
summary(trans_model_lp_rm)
Call:
lm(formula = target_wins ~ batting_3b + batting_bb + batting_so +
baserun_sb + baserun_cs + pitching_bb + fielding_e + fielding_dp +
batting_hbp_bi + total_bases + total_bases_allowed + walks_over_OP +
SO_over_OP, data = transformed_rm)
Residuals:
Min 1Q Median 3Q Max
-2.46363 -0.54355 -0.00617 0.51656 2.66998
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.003696 0.016558 -0.223 0.823381
batting_3b 0.192107 0.032017 6.000 2.29e-09 ***
batting_bb 0.301440 0.052999 5.688 1.46e-08 ***
batting_so -0.315611 0.030045 -10.505 < 2e-16 ***
baserun_sb 0.171528 0.038988 4.400 1.14e-05 ***
baserun_cs 0.146010 0.041517 3.517 0.000445 ***
pitching_bb -0.276402 0.048349 -5.717 1.23e-08 ***
fielding_e -0.587918 0.037208 -15.801 < 2e-16 ***
fielding_dp -0.200670 0.023188 -8.654 < 2e-16 ***
batting_hbp_bi -0.156260 0.019739 -7.916 3.81e-15 ***
total_bases 0.328699 0.040330 8.150 5.96e-16 ***
total_bases_allowed 0.225644 0.041771 5.402 7.29e-08 ***
walks_over_OP 0.146176 0.047327 3.089 0.002036 **
SO_over_OP 0.048142 0.033067 1.456 0.145559
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7876 on 2250 degrees of freedom
Multiple R-squared: 0.3699, Adjusted R-squared: 0.3663
F-statistic: 101.6 on 13 and 2250 DF, p-value: < 2.2e-16
paste('MSE equal ', mse(trans_model_lp_rm))
[1] "MSE equal 0.616546372501848"
stepwise_base_model_bd_rm <- stepAIC(trans_model_all_rm, direction = "both")
Start: AIC=-1074.48
target_wins ~ batting_h + batting_2b + batting_3b + batting_bb +
batting_so + baserun_sb + baserun_cs + pitching_hr + pitching_bb +
pitching_so + fielding_e + fielding_dp + batting_hbp_bi +
batting_1B + total_bases + total_bases_allowed + HR_over_OP +
walks_over_OP + SO_over_OP
Df Sum of Sq RSS AIC
- pitching_so 1 0.018 1383.9 -1076.45
- SO_over_OP 1 0.063 1383.9 -1076.37
- batting_h 1 0.239 1384.1 -1076.09
- batting_1B 1 0.259 1384.1 -1076.05
- pitching_hr 1 0.753 1384.6 -1075.24
<none> 1383.9 -1074.48
- batting_2b 1 4.629 1388.5 -1068.92
- total_bases_allowed 1 5.633 1389.5 -1067.28
- baserun_sb 1 5.796 1389.7 -1067.01
- HR_over_OP 1 6.475 1390.3 -1065.91
- walks_over_OP 1 7.302 1391.2 -1064.56
- batting_3b 1 7.935 1391.8 -1063.53
- batting_so 1 8.237 1392.1 -1063.04
- baserun_cs 1 8.293 1392.2 -1062.95
- batting_bb 1 13.996 1397.8 -1053.69
- total_bases 1 15.805 1399.7 -1050.77
- pitching_bb 1 21.365 1405.2 -1041.79
- batting_hbp_bi 1 28.690 1412.5 -1030.02
- fielding_dp 1 45.151 1429.0 -1003.79
- fielding_e 1 155.835 1539.7 -834.89
Step: AIC=-1076.45
target_wins ~ batting_h + batting_2b + batting_3b + batting_bb +
batting_so + baserun_sb + baserun_cs + pitching_hr + pitching_bb +
fielding_e + fielding_dp + batting_hbp_bi + batting_1B +
total_bases + total_bases_allowed + HR_over_OP + walks_over_OP +
SO_over_OP
Df Sum of Sq RSS AIC
- SO_over_OP 1 0.096 1384.0 -1078.29
- batting_h 1 0.226 1384.1 -1078.08
- batting_1B 1 0.249 1384.1 -1078.04
- pitching_hr 1 0.753 1384.6 -1077.21
<none> 1383.9 -1076.45
+ pitching_so 1 0.018 1383.9 -1074.48
- batting_2b 1 4.643 1388.5 -1070.86
- baserun_sb 1 5.834 1389.7 -1068.92
- HR_over_OP 1 6.460 1390.3 -1067.90
- walks_over_OP 1 7.329 1391.2 -1066.49
- batting_3b 1 8.042 1391.9 -1065.33
- total_bases_allowed 1 8.469 1392.3 -1064.63
- baserun_cs 1 8.477 1392.3 -1064.62
- batting_bb 1 14.026 1397.9 -1055.62
- total_bases 1 16.330 1400.2 -1051.89
- pitching_bb 1 21.348 1405.2 -1043.79
- batting_hbp_bi 1 28.711 1412.6 -1031.96
- batting_so 1 37.451 1421.3 -1017.99
- fielding_dp 1 45.453 1429.3 -1005.28
- fielding_e 1 155.990 1539.9 -836.64
Step: AIC=-1078.29
target_wins ~ batting_h + batting_2b + batting_3b + batting_bb +
batting_so + baserun_sb + baserun_cs + pitching_hr + pitching_bb +
fielding_e + fielding_dp + batting_hbp_bi + batting_1B +
total_bases + total_bases_allowed + HR_over_OP + walks_over_OP
Df Sum of Sq RSS AIC
- batting_h 1 0.206 1384.2 -1079.95
- batting_1B 1 0.224 1384.2 -1079.92
- pitching_hr 1 0.748 1384.7 -1079.07
<none> 1384.0 -1078.29
+ SO_over_OP 1 0.096 1383.9 -1076.45
+ pitching_so 1 0.051 1383.9 -1076.37
- batting_2b 1 4.563 1388.5 -1072.84
- baserun_sb 1 5.799 1389.8 -1070.82
- walks_over_OP 1 7.637 1391.6 -1067.83
- HR_over_OP 1 7.961 1391.9 -1067.30
- batting_3b 1 8.072 1392.0 -1067.12
- total_bases_allowed 1 8.861 1392.8 -1065.84
- baserun_cs 1 9.120 1393.1 -1065.42
- batting_bb 1 13.936 1397.9 -1057.61
- total_bases 1 16.235 1400.2 -1053.89
- pitching_bb 1 21.295 1405.3 -1045.72
- batting_hbp_bi 1 28.695 1412.7 -1033.83
- batting_so 1 39.495 1423.5 -1016.59
- fielding_dp 1 45.517 1429.5 -1007.03
- fielding_e 1 155.950 1539.9 -838.55
Step: AIC=-1079.95
target_wins ~ batting_2b + batting_3b + batting_bb + batting_so +
baserun_sb + baserun_cs + pitching_hr + pitching_bb + fielding_e +
fielding_dp + batting_hbp_bi + batting_1B + total_bases +
total_bases_allowed + HR_over_OP + walks_over_OP
Df Sum of Sq RSS AIC
- batting_1B 1 0.025 1384.2 -1081.91
- pitching_hr 1 0.658 1384.8 -1080.88
<none> 1384.2 -1079.95
+ batting_h 1 0.206 1384.0 -1078.29
+ SO_over_OP 1 0.076 1384.1 -1078.08
+ pitching_so 1 0.025 1384.2 -1077.99
- batting_2b 1 5.393 1389.6 -1073.15
- baserun_sb 1 5.593 1389.8 -1072.82
- HR_over_OP 1 7.760 1391.9 -1069.29
- walks_over_OP 1 8.407 1392.6 -1068.24
- baserun_cs 1 9.161 1393.3 -1067.02
- batting_3b 1 10.021 1394.2 -1065.62
- total_bases_allowed 1 12.507 1396.7 -1061.59
- batting_bb 1 13.735 1397.9 -1059.60
- total_bases 1 19.547 1403.7 -1050.20
- pitching_bb 1 22.029 1406.2 -1046.21
- batting_hbp_bi 1 28.540 1412.7 -1035.75
- batting_so 1 39.366 1423.5 -1018.46
- fielding_dp 1 45.786 1430.0 -1008.28
- fielding_e 1 155.844 1540.0 -840.41
Step: AIC=-1081.91
target_wins ~ batting_2b + batting_3b + batting_bb + batting_so +
baserun_sb + baserun_cs + pitching_hr + pitching_bb + fielding_e +
fielding_dp + batting_hbp_bi + total_bases + total_bases_allowed +
HR_over_OP + walks_over_OP
Df Sum of Sq RSS AIC
- pitching_hr 1 0.798 1385.0 -1082.61
<none> 1384.2 -1081.91
+ SO_over_OP 1 0.071 1384.1 -1080.03
+ pitching_so 1 0.027 1384.2 -1079.96
+ batting_1B 1 0.025 1384.2 -1079.95
+ batting_h 1 0.007 1384.2 -1079.92
- batting_2b 1 5.979 1390.2 -1074.15
- baserun_sb 1 6.583 1390.8 -1073.17
- HR_over_OP 1 7.902 1392.1 -1071.02
- walks_over_OP 1 8.385 1392.6 -1070.24
- baserun_cs 1 9.333 1393.5 -1068.70
- batting_3b 1 11.256 1395.5 -1065.58
- total_bases_allowed 1 12.765 1397.0 -1063.13
- batting_bb 1 15.311 1399.5 -1059.01
- pitching_bb 1 22.024 1406.2 -1048.17
- batting_hbp_bi 1 30.199 1414.4 -1035.05
- total_bases 1 30.243 1414.4 -1034.98
- batting_so 1 41.737 1425.9 -1016.66
- fielding_dp 1 46.748 1431.0 -1008.71
- fielding_e 1 156.852 1541.0 -840.89
Step: AIC=-1082.61
target_wins ~ batting_2b + batting_3b + batting_bb + batting_so +
baserun_sb + baserun_cs + pitching_bb + fielding_e + fielding_dp +
batting_hbp_bi + total_bases + total_bases_allowed + HR_over_OP +
walks_over_OP
Df Sum of Sq RSS AIC
<none> 1385.0 -1082.61
+ pitching_hr 1 0.798 1384.2 -1081.91
+ batting_h 1 0.262 1384.7 -1081.04
+ batting_1B 1 0.165 1384.8 -1080.88
+ SO_over_OP 1 0.090 1384.9 -1080.76
+ pitching_so 1 0.026 1385.0 -1080.65
- batting_2b 1 5.350 1390.3 -1075.88
- HR_over_OP 1 7.157 1392.2 -1072.94
- walks_over_OP 1 7.840 1392.8 -1071.83
- baserun_cs 1 9.159 1394.2 -1069.69
- baserun_sb 1 10.689 1395.7 -1067.20
- total_bases_allowed 1 11.977 1397.0 -1065.11
- batting_3b 1 16.760 1401.8 -1057.38
- batting_bb 1 17.679 1402.7 -1055.89
- pitching_bb 1 21.807 1406.8 -1049.24
- batting_hbp_bi 1 29.520 1414.5 -1036.86
- total_bases 1 44.500 1429.5 -1013.01
- fielding_dp 1 46.784 1431.8 -1009.40
- batting_so 1 78.285 1463.3 -960.12
- fielding_e 1 158.069 1543.1 -839.93
par(mfrow = c(2, 2))
plot(stepwise_base_model_bd_rm)
summary(stepwise_base_model_bd_rm)
Call:
lm(formula = target_wins ~ batting_2b + batting_3b + batting_bb +
batting_so + baserun_sb + baserun_cs + pitching_bb + fielding_e +
fielding_dp + batting_hbp_bi + total_bases + total_bases_allowed +
HR_over_OP + walks_over_OP, data = transformed_rm)
Residuals:
Min 1Q Median 3Q Max
-2.57248 -0.53547 -0.00539 0.51609 2.59983
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.004081 0.016497 -0.247 0.804617
batting_2b -0.079660 0.027025 -2.948 0.003236 **
batting_3b 0.168995 0.032394 5.217 1.99e-07 ***
batting_bb 0.288103 0.053772 5.358 9.28e-08 ***
batting_so -0.323597 0.028701 -11.275 < 2e-16 ***
baserun_sb 0.163636 0.039278 4.166 3.22e-05 ***
baserun_cs 0.157213 0.040766 3.856 0.000118 ***
pitching_bb -0.287537 0.048320 -5.951 3.09e-09 ***
fielding_e -0.597365 0.037286 -16.021 < 2e-16 ***
fielding_dp -0.201650 0.023136 -8.716 < 2e-16 ***
batting_hbp_bi -0.139175 0.020102 -6.924 5.72e-12 ***
total_bases 0.401544 0.047237 8.501 < 2e-16 ***
total_bases_allowed 0.187277 0.042466 4.410 1.08e-05 ***
HR_over_OP -0.102766 0.030146 -3.409 0.000664 ***
walks_over_OP 0.159923 0.044820 3.568 0.000367 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7847 on 2249 degrees of freedom
Multiple R-squared: 0.3748, Adjusted R-squared: 0.3709
F-statistic: 96.31 on 14 and 2249 DF, p-value: < 2.2e-16
paste('MSE equal ', mse(stepwise_base_model_bd_rm))
[1] "MSE equal 0.611748149347107"
stepwise_base_model_fw_rm <- stepAIC(trans_model_all_rm, direction = "forward")
Start: AIC=-1074.48
target_wins ~ batting_h + batting_2b + batting_3b + batting_bb +
batting_so + baserun_sb + baserun_cs + pitching_hr + pitching_bb +
pitching_so + fielding_e + fielding_dp + batting_hbp_bi +
batting_1B + total_bases + total_bases_allowed + HR_over_OP +
walks_over_OP + SO_over_OP
par(mfrow = c(2, 2))
plot(stepwise_base_model_fw_rm)
summary(stepwise_base_model_fw_rm)
Call:
lm(formula = target_wins ~ batting_h + batting_2b + batting_3b +
batting_bb + batting_so + baserun_sb + baserun_cs + pitching_hr +
pitching_bb + pitching_so + fielding_e + fielding_dp + batting_hbp_bi +
batting_1B + total_bases + total_bases_allowed + HR_over_OP +
walks_over_OP + SO_over_OP, data = transformed_rm)
Residuals:
Min 1Q Median 3Q Max
-2.56849 -0.53857 -0.00773 0.51266 2.60550
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.004146 0.016510 -0.251 0.801724
batting_h 0.062669 0.100623 0.623 0.533471
batting_2b -0.117553 0.042909 -2.740 0.006200 **
batting_3b 0.141785 0.039526 3.587 0.000342 ***
batting_bb 0.278455 0.058451 4.764 2.02e-06 ***
batting_so -0.318815 0.087236 -3.655 0.000263 ***
baserun_sb 0.144407 0.047105 3.066 0.002198 **
baserun_cs 0.156213 0.042597 3.667 0.000251 ***
pitching_hr -0.086129 0.077932 -1.105 0.269198
pitching_bb -0.287292 0.048809 -5.886 4.55e-09 ***
pitching_so 0.013010 0.075139 0.173 0.862557
fielding_e -0.606150 0.038131 -15.896 < 2e-16 ***
fielding_dp -0.200150 0.023391 -8.557 < 2e-16 ***
batting_hbp_bi -0.143371 0.021020 -6.821 1.16e-11 ***
batting_1B -0.051523 0.079544 -0.648 0.517223
total_bases 0.451258 0.089138 5.062 4.48e-07 ***
total_bases_allowed 0.179306 0.059328 3.022 0.002537 **
HR_over_OP -0.116654 0.036001 -3.240 0.001212 **
walks_over_OP 0.168838 0.049067 3.441 0.000590 ***
SO_over_OP 0.012120 0.037866 0.320 0.748930
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7853 on 2244 degrees of freedom
Multiple R-squared: 0.3753, Adjusted R-squared: 0.3701
F-statistic: 70.97 on 19 and 2244 DF, p-value: < 2.2e-16
paste('MSE equal ', mse(stepwise_base_model_fw_rm))
[1] "MSE equal 0.611243383828704"
stepwise_base_model_bw_rm <- stepAIC(trans_model_all_rm, direction = "backward")
Start: AIC=-1074.48
target_wins ~ batting_h + batting_2b + batting_3b + batting_bb +
batting_so + baserun_sb + baserun_cs + pitching_hr + pitching_bb +
pitching_so + fielding_e + fielding_dp + batting_hbp_bi +
batting_1B + total_bases + total_bases_allowed + HR_over_OP +
walks_over_OP + SO_over_OP
Df Sum of Sq RSS AIC
- pitching_so 1 0.018 1383.9 -1076.45
- SO_over_OP 1 0.063 1383.9 -1076.37
- batting_h 1 0.239 1384.1 -1076.09
- batting_1B 1 0.259 1384.1 -1076.05
- pitching_hr 1 0.753 1384.6 -1075.24
<none> 1383.9 -1074.48
- batting_2b 1 4.629 1388.5 -1068.92
- total_bases_allowed 1 5.633 1389.5 -1067.28
- baserun_sb 1 5.796 1389.7 -1067.01
- HR_over_OP 1 6.475 1390.3 -1065.91
- walks_over_OP 1 7.302 1391.2 -1064.56
- batting_3b 1 7.935 1391.8 -1063.53
- batting_so 1 8.237 1392.1 -1063.04
- baserun_cs 1 8.293 1392.2 -1062.95
- batting_bb 1 13.996 1397.8 -1053.69
- total_bases 1 15.805 1399.7 -1050.77
- pitching_bb 1 21.365 1405.2 -1041.79
- batting_hbp_bi 1 28.690 1412.5 -1030.02
- fielding_dp 1 45.151 1429.0 -1003.79
- fielding_e 1 155.835 1539.7 -834.89
Step: AIC=-1076.45
target_wins ~ batting_h + batting_2b + batting_3b + batting_bb +
batting_so + baserun_sb + baserun_cs + pitching_hr + pitching_bb +
fielding_e + fielding_dp + batting_hbp_bi + batting_1B +
total_bases + total_bases_allowed + HR_over_OP + walks_over_OP +
SO_over_OP
Df Sum of Sq RSS AIC
- SO_over_OP 1 0.096 1384.0 -1078.29
- batting_h 1 0.226 1384.1 -1078.08
- batting_1B 1 0.249 1384.1 -1078.04
- pitching_hr 1 0.753 1384.6 -1077.21
<none> 1383.9 -1076.45
- batting_2b 1 4.643 1388.5 -1070.86
- baserun_sb 1 5.834 1389.7 -1068.92
- HR_over_OP 1 6.460 1390.3 -1067.90
- walks_over_OP 1 7.329 1391.2 -1066.49
- batting_3b 1 8.042 1391.9 -1065.33
- total_bases_allowed 1 8.469 1392.3 -1064.63
- baserun_cs 1 8.477 1392.3 -1064.62
- batting_bb 1 14.026 1397.9 -1055.62
- total_bases 1 16.330 1400.2 -1051.89
- pitching_bb 1 21.348 1405.2 -1043.79
- batting_hbp_bi 1 28.711 1412.6 -1031.96
- batting_so 1 37.451 1421.3 -1017.99
- fielding_dp 1 45.453 1429.3 -1005.28
- fielding_e 1 155.990 1539.9 -836.64
Step: AIC=-1078.29
target_wins ~ batting_h + batting_2b + batting_3b + batting_bb +
batting_so + baserun_sb + baserun_cs + pitching_hr + pitching_bb +
fielding_e + fielding_dp + batting_hbp_bi + batting_1B +
total_bases + total_bases_allowed + HR_over_OP + walks_over_OP
Df Sum of Sq RSS AIC
- batting_h 1 0.206 1384.2 -1079.95
- batting_1B 1 0.224 1384.2 -1079.92
- pitching_hr 1 0.748 1384.7 -1079.07
<none> 1384.0 -1078.29
- batting_2b 1 4.563 1388.5 -1072.84
- baserun_sb 1 5.799 1389.8 -1070.82
- walks_over_OP 1 7.637 1391.6 -1067.83
- HR_over_OP 1 7.961 1391.9 -1067.30
- batting_3b 1 8.072 1392.0 -1067.12
- total_bases_allowed 1 8.861 1392.8 -1065.84
- baserun_cs 1 9.120 1393.1 -1065.42
- batting_bb 1 13.936 1397.9 -1057.61
- total_bases 1 16.235 1400.2 -1053.89
- pitching_bb 1 21.295 1405.3 -1045.72
- batting_hbp_bi 1 28.695 1412.7 -1033.83
- batting_so 1 39.495 1423.5 -1016.59
- fielding_dp 1 45.517 1429.5 -1007.03
- fielding_e 1 155.950 1539.9 -838.55
Step: AIC=-1079.95
target_wins ~ batting_2b + batting_3b + batting_bb + batting_so +
baserun_sb + baserun_cs + pitching_hr + pitching_bb + fielding_e +
fielding_dp + batting_hbp_bi + batting_1B + total_bases +
total_bases_allowed + HR_over_OP + walks_over_OP
Df Sum of Sq RSS AIC
- batting_1B 1 0.025 1384.2 -1081.91
- pitching_hr 1 0.658 1384.8 -1080.88
<none> 1384.2 -1079.95
- batting_2b 1 5.393 1389.6 -1073.15
- baserun_sb 1 5.593 1389.8 -1072.82
- HR_over_OP 1 7.760 1391.9 -1069.29
- walks_over_OP 1 8.407 1392.6 -1068.24
- baserun_cs 1 9.161 1393.3 -1067.02
- batting_3b 1 10.021 1394.2 -1065.62
- total_bases_allowed 1 12.507 1396.7 -1061.59
- batting_bb 1 13.735 1397.9 -1059.60
- total_bases 1 19.547 1403.7 -1050.20
- pitching_bb 1 22.029 1406.2 -1046.21
- batting_hbp_bi 1 28.540 1412.7 -1035.75
- batting_so 1 39.366 1423.5 -1018.46
- fielding_dp 1 45.786 1430.0 -1008.28
- fielding_e 1 155.844 1540.0 -840.41
Step: AIC=-1081.91
target_wins ~ batting_2b + batting_3b + batting_bb + batting_so +
baserun_sb + baserun_cs + pitching_hr + pitching_bb + fielding_e +
fielding_dp + batting_hbp_bi + total_bases + total_bases_allowed +
HR_over_OP + walks_over_OP
Df Sum of Sq RSS AIC
- pitching_hr 1 0.798 1385.0 -1082.61
<none> 1384.2 -1081.91
- batting_2b 1 5.979 1390.2 -1074.15
- baserun_sb 1 6.583 1390.8 -1073.17
- HR_over_OP 1 7.902 1392.1 -1071.02
- walks_over_OP 1 8.385 1392.6 -1070.24
- baserun_cs 1 9.333 1393.5 -1068.70
- batting_3b 1 11.256 1395.5 -1065.58
- total_bases_allowed 1 12.765 1397.0 -1063.13
- batting_bb 1 15.311 1399.5 -1059.01
- pitching_bb 1 22.024 1406.2 -1048.17
- batting_hbp_bi 1 30.199 1414.4 -1035.05
- total_bases 1 30.243 1414.4 -1034.98
- batting_so 1 41.737 1425.9 -1016.66
- fielding_dp 1 46.748 1431.0 -1008.71
- fielding_e 1 156.852 1541.0 -840.89
Step: AIC=-1082.61
target_wins ~ batting_2b + batting_3b + batting_bb + batting_so +
baserun_sb + baserun_cs + pitching_bb + fielding_e + fielding_dp +
batting_hbp_bi + total_bases + total_bases_allowed + HR_over_OP +
walks_over_OP
Df Sum of Sq RSS AIC
<none> 1385.0 -1082.61
- batting_2b 1 5.350 1390.3 -1075.88
- HR_over_OP 1 7.157 1392.2 -1072.94
- walks_over_OP 1 7.840 1392.8 -1071.83
- baserun_cs 1 9.159 1394.2 -1069.69
- baserun_sb 1 10.689 1395.7 -1067.20
- total_bases_allowed 1 11.977 1397.0 -1065.11
- batting_3b 1 16.760 1401.8 -1057.38
- batting_bb 1 17.679 1402.7 -1055.89
- pitching_bb 1 21.807 1406.8 -1049.24
- batting_hbp_bi 1 29.520 1414.5 -1036.86
- total_bases 1 44.500 1429.5 -1013.01
- fielding_dp 1 46.784 1431.8 -1009.40
- batting_so 1 78.285 1463.3 -960.12
- fielding_e 1 158.069 1543.1 -839.93
par(mfrow = c(2, 2))
plot(stepwise_base_model_bw_rm)
summary(stepwise_base_model_bw_rm)
Call:
lm(formula = target_wins ~ batting_2b + batting_3b + batting_bb +
batting_so + baserun_sb + baserun_cs + pitching_bb + fielding_e +
fielding_dp + batting_hbp_bi + total_bases + total_bases_allowed +
HR_over_OP + walks_over_OP, data = transformed_rm)
Residuals:
Min 1Q Median 3Q Max
-2.57248 -0.53547 -0.00539 0.51609 2.59983
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.004081 0.016497 -0.247 0.804617
batting_2b -0.079660 0.027025 -2.948 0.003236 **
batting_3b 0.168995 0.032394 5.217 1.99e-07 ***
batting_bb 0.288103 0.053772 5.358 9.28e-08 ***
batting_so -0.323597 0.028701 -11.275 < 2e-16 ***
baserun_sb 0.163636 0.039278 4.166 3.22e-05 ***
baserun_cs 0.157213 0.040766 3.856 0.000118 ***
pitching_bb -0.287537 0.048320 -5.951 3.09e-09 ***
fielding_e -0.597365 0.037286 -16.021 < 2e-16 ***
fielding_dp -0.201650 0.023136 -8.716 < 2e-16 ***
batting_hbp_bi -0.139175 0.020102 -6.924 5.72e-12 ***
total_bases 0.401544 0.047237 8.501 < 2e-16 ***
total_bases_allowed 0.187277 0.042466 4.410 1.08e-05 ***
HR_over_OP -0.102766 0.030146 -3.409 0.000664 ***
walks_over_OP 0.159923 0.044820 3.568 0.000367 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7847 on 2249 degrees of freedom
Multiple R-squared: 0.3748, Adjusted R-squared: 0.3709
F-statistic: 96.31 on 14 and 2249 DF, p-value: < 2.2e-16
paste('MSE equal ', mse(stepwise_base_model_bw_rm))
[1] "MSE equal 0.611748149347107"
It definitely made a difference when the transformation was applied. The only problem that I faced was that my prediction when in the thousands if I used the models created on the transformed data set. After paying a close attention on the Cook’s distance for the models’ residual, I removed certain observation that led to an improved model.
After testing more than 10 models, using different techniques and transformation, I settled with a model built after I capped outliers, removed variables causing multicollinearity, variables with low p-value, and removed influencial observations.
Here is the model base_model_lp_rm: Target Wins = 32.157432 - 0.035903 * moneyball_imp_test$batting_2b + 0.068862 * moneyball_imp_test$batting_3b + 0.044466 * moneyball_imp_test$batting_bb - 0.016966 * moneyball_imp_test$batting_so + 0.060647 * moneyball_imp_test$baserun_sb - 0.050230 * moneyball_imp_test$pitching_bb - 0.043364 * moneyball_imp_test$fielding_e - 0.105258 * moneyball_imp_test$fielding_dp - 4.089404 * moneyball_imp_test$batting_hbp_bi + 0.021326 * moneyball_imp_test$total_bases + 0.011782 * moneyball_imp_test$total_bases_allowed + 0.023997 * moneyball_imp_test$walks_over_OP + 0.008021 * moneyball_imp_test$SO_over_OP
When looking at the Rsquared and Adjusted Rsquared together with the residual plots, the base_model_lp_rm model was not the best model. The stepwise model after removing influencial observation were the best model, but when tested on the test dataset, the numbers were in the thousands. It could be a step I missed, but base_model_lp_rm will be my final model.